Code Monkey home page Code Monkey logo

csv-reconcile's Introduction

CSV Reconcile

A reconciliation service for OpenRefine based on a CSV file similar to reconcile-csv. This one is written in Python and has some more configurability.

Quick start

  • Clone this repository
  • Run the service
    $ python -m venv venv                                             # create virtualenv
    $ venv/bin/pip install csv-reconcile                              # install package
    $ source venv/bin/activate                                        # activate virtual environment
    (venv) $ csv-reconcile init sample/reps.tsv item itemLabel        # initialize the service
    (venv) $ csv-reconcile serve                                      # run the service
    (venv) $ deactivate                                               # remove virtual environment
        

The service is run at http://127.0.0.1:5000/reconcile. You can point at a different host:port by adding SERVER_NAME to the sample.cfg. Since this is running from a virtualenv, you can simply delete the whole lot to clean up.

If you have a C compiler installed you may prefer to install the sdist dist/csv-reconcile-0.1.0.tar.gz which will build a Cython version of the computationally intensive fuzzy match routine for speed. With pip add the option --no-binary csv-reconcile.

Poetry

Prerequesites

You’ll need to have both poetry and poethepoet installed. For publishing to PyPI pandoc is required.

Running

This is packaged with poetry, so you can use those commands if you have it installed.

$ poe install
$ poetry run csv-reconcile init sample/reps.tsv item itemLabel
$ poetry run csv-reconcile serve

Building

Because this package uses a README.org file and pip requires a README.md, there are extra build steps beyond what poetry supplies. These are managed using poethepoet. Thus building is done as follows:

$ poe build

If you want to build a platform agnostic wheel, you’ll have to comment out the build = "build.py" line from pyproject.toml until poetry supports selecting build platform.

Description

This reconciliation service uses Dice coefficient scoring to reconcile values against a given column in a CSV file. The CSV file must contain a column containing distinct values to reconcile to. We’ll call this the id column. We’ll call the column being reconciled against the name column.

For performance reasons, the name column is preprocessed to normalized values which are stored in an sqlite database. This database must be initialized at least once by running the init sub-command. Once initialized this need not be run for subsequent runs.

Note that the service supplies all its data with a dummy type so there is no reason to reconcile against any particular type.

In addition to reconciling against the name column, the service also functions as a data extension service, which offers any of the other columns of the CSV file.

Note that Dice coefficient scoring is agnostic to word ordering.

Usage

Basic usage involves two steps:

  • initialization
  • running the service

Initialization primes the database with the data processed from the CSV file with the init subcommand. There are several options for running the service as described below.

Initialization

Basic usage of the init sub-command requires passing the name of the CSV file, the id column and the name column.

(venv) $ csv-reconcile --help
Usage: csv-reconcile [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  init
  run
  serve
(venv) $ csv-reconcile init --help
Usage: csv-reconcile init [OPTIONS] CSVFILE IDCOL NAMECOL

Options:
  --config TEXT  config file
  --scorer TEXT  scoring plugin to use
  --help         Show this message and exit.
(venv) $ poetry run csv-reconcile serve --help
Usage: csv-reconcile serve [OPTIONS]

Options:
  --help         Show this message and exit.
(venv) $

The --config option is used to point to a configuration file. The file is a Flask configuration and hence is Python code though most configuration is simply setting variables to constant values.

Running the service

The simplest way to run the service is to use Flask’s built-in web server with the serve subcommand which takes no arguments. However, as mentioned in the Flask documentation, this server is not suitable for production purposes.

For a more hardened service, you can use one of the other deployment options mentioned in that documentation. For example, gunicorn can be run as follows:

(venv) $ gunicorn -w 4 'csv_reconcile:create_app()'
1-11-16 17:40:20 +0900] [84625] [INFO] Starting gunicorn 20.1.0
1-11-16 17:40:20 +0900] [84625] [INFO] Listening at: http://127.0.0.1:8000 (84625)
1-11-16 17:40:20 +0900] [84625] [INFO] Using worker: sync
1-11-16 17:40:20 +0900] [84626] [INFO] Booting worker with pid: 84626
1-11-16 17:40:20 +0900] [84627] [INFO] Booting worker with pid: 84627
1-11-16 17:40:20 +0900] [84628] [INFO] Booting worker with pid: 84628
1-11-16 17:40:20 +0900] [84629] [INFO] Booting worker with pid: 84629
...

One thing to watch out for is that the default manifest points the extension service to port 5000, the default port for the Flask built-in web server. If you want to use the extension service when deploying to a different port, you’ll want to be sure to override that part of the manifest in your config file. You’ll need something like the following:

MANIFEST = {
    "extend": {
        "propose_properties": {
            "service_url": "http://localhost:8000",
            "service_path": "/properties"
        }
    }
}

Note also that the configuration is saved during the init step. If you change the config, you’ll need to re-run that step. You may also need to delete and re-add the service in OpenRefine.

Deprecated

The run subcommand mimics the old behavior which combined the initialization step with the running of the service. This may be removed in a future release.

Common configuration

  • SERVER_NAME - The host and port the service is bound to. e.g. SERVER_NAME=localhost:5555. ( Default localhost:5000 )
  • CSVKWARGS - Arguments to pass to csv.reader. e.g. CSVKWARGS={'delimiter': ',', 'quotechar': '"'} for comma delimited files using " as quote character.
  • CSVENCODING - Encoding of the CSV file. e.g. CSVENCODING="utf-8-sig" is the encoding used for data downloaded from GNIS.
  • SCOREOPTIONS - Options passed to scoring plugin during normalization. e.g. SCOREOPTIONS={'stopwords':['lake','reservoir']}
  • LIMIT - The maximum number of reonciliation candidates returned per entry. ( Default 10 ) e.g. LIMIT=10
  • THRESHOLD - The minimum score for returned reconciliation candidates. ( Default 30.0 ) e.g. THRESHOLD=80.5
  • DATABASE - The name of the generated sqlite database containing pre-processed values. (Default csvreconcile.db) e.g. DATABASE='lakes.db' You may want to change the name of the database if you regularly switch between databases being used.
  • MANIFEST - Overrides for the service manifest. e.g. MANIFEST={"name": "My service"} sets the name of the service to “My service”.

This last is most interesting. If your data is coming from Wikidata and your id column contains Q values, then a manifest like the following will allow your links to be clickable inside OpenRefine.

MANIFEST = {
  "identifierSpace": "http://www.wikidata.org/entity/",
  "schemaSpace": "http://www.wikidata.org/prop/direct/",
  "view": {"url":"https://www.wikidata.org/wiki/{{id}}"},
  "name": "My reconciliation service"
}

If your CSV is made up of data taken from another reconciliation service, you may similiarly copy parts of their manifest to make use of their features, such as the preview service. See the reconciliation spec for details.

Built-in preview service

There is a preview service built into the tool. (Thanks b2m!) You can turn it on by adding the following to your manifest:

"preview": {
   "url": "http://localhost:5000/preview/{{id}}",
   "width": 400,
   "height": 300
}

Note that if you reconcile against a service with a preview service enabled, a link to the service becomes part of the project. Thus if you bring the service down, your project will have hover over pop-ups to an unavailable service. One way around this is to copy the recon.match.id to a new column which can be re-reconciled to the column by id if you bring the service back up again whether or not you have preview service enabled. (Perhaps OpenRefine could be smarter about enabling this pop-ups only when the service is active.)

Scoring plugins

As mentioned above the default scoring method is to use Dice coefficient scoring, but this method can be overridden by implementing a cvs_reconcile.scorers plugin.

Implementing

A plugin module may override any of the methods in the csv_reconcile.scorers module by simply implementing a method of the same name with the decorator @cvs_reconcile.scorer.register.

See csv_reconcile_dice for how Dice coefficient scoring is implemented.

The basic hooks are as follows:

  • normalizedWord(word, **scoreOptions) preprocesses values to be reconciled to produce a tuple used in fuzzy match scoring. The value of SCOREOPTIONS in the configuration will be passed in to allow configuration of this preprocessing. This hook is required.
  • normalizedRow(word, row, **scoreOptions) preprocesses values to be reconciled against to produce a tuple used in fuzzy match scoring. Note that both the reconciled column and the entire row is available for calculating the normalized value and that the column reconciled against is required even when not used. The value of SCOREOPTIONS in the configuration will be passed in to allow configuration of this preprocessing. This defaults to calling normalizeWord(word,**scoreOptions).
  • getNormalizedFields() returns a tuple of names for the columns produced by normalizeWord(). The length of the return value from both functions must match. This defaults to calling normalizeWord(word,**scoreOptions). This hook is required.
  • processScoreOptions(options) is passed the value of SCOREOPTIONS to allow it to be adjusted prior to being used. This can be used for adding defaults and/or validating the configuration. This hook is optional
  • scoreMatch(left, right, **scoreOptions) gets passed two tuples as returned by normalizedWord(). The left value is the value being reconciled and the right value is the value being reconciled against. The value of SCOREOPTIONS in the configuration will be passed in to allow configuration of this preprocessing. Returning a score of None will not add tested value as a candidate. This hook is required.
  • valid(normalizedFields) is passed the normalized tuple prior to being scored to make sure it’s appropriate for the calculation. This hook is optional.
  • features(word, row, **scoreOptions) calculates features using the query string and the normalized row. By default calculating features is disabled. Implementions of this hook are automatically enabled. This hook is optional.

Installing

Hooks are automatically discovered as long as they provide a csv_reconcile.scorers setuptools entry point. Poetry supplies a plugins configuration which wraps the setuptools funtionality.

The default Dice coefficent scoring is supplied via the following snippet from pyproject.toml file.

[tool.poetry.plugins."csv_reconcile.scorers"]
"dice" = "csv_reconcile_dice"

Here dice becomes the name of the scoring option and csv_reconcile_dice is the package implementing the plugin.

Using

If there is only one scoring plugin available, that plugin is used. If there are more than one available, you will be prompted to pass the --scorer option to select among the scoring options.

Known plugins

See wiki for list of known plugins.

Testing

Though I long for the old days when a unit test was a unit test, these days things are a bit more complicated with various versions of Python and installation of plugins to manage. Now we have to wrestle with virtual environments. poetry handles the virtual environment for developing, but testing involves covering more options.

Tests layout

The tests directory structure is the following:

tests
    main
    plugins
        geo

Tests for the main package are found under main and don’t require installing any other packages whereas tests under plugins require the installation of the given plugin.

Running tests

Basic tests

These tests are written with pytest and can be running through poetry as follows:

$ poetry run pytest

To avoid the complications that come from installing plugins, there is a poe script for running only the tests under main which can be invoked as follows:

$ poe test

For steady state developing this is probably the command you’ll use most often.

Build matrices

The GitHub Actions for this project currently use a build matrix across a couple of architectures and several versions of Python, but a similar effect can be achieved using nox.

nox manages the creation of various virtual environments in what they call “sessions”, from which various commands can be run. This project’s noxfile.py manages the installation of the csv-reconcile-geo plugin for the plugin tests as well as running across several versions of Python. See the nox documentation for detail.

Some versions of this command you’re likely to run are as follows:

$ nox      # Run all the tests building virtual environemnts from scratch
$ nox -r   # Reuse previously built virtual environments for speed
$ nox -s test_geo  # Run only the tests for the csv-reconcile-geo plugin
$ nox -s test_main -p 3.8   # Run only the main tests with Python3.8

Eventually, the GitHub Actions may be changed to use setup-nox.

Future enhancements

It would be nice to add support for using properties as part of the scoring, so that more than one column of the csv could be taken into consideration.

csv-reconcile's People

Contributors

1-byte avatar b2m avatar dependabot[bot] avatar gitonthescene avatar tfmorris avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

csv-reconcile's Issues

Continuous integration?

I see there is a test suite already, so perhaps it would be worth running it in a continuous integration service?

As a side effect, this is a good way to document the install process on a stock machine (since you have to script it for the CI). I actually looked for the CI configuration files as a way to solve my install problems (#8).

ValueError: 'item' is not in list

I was able to set up and run csv-reconcile serve, but cannot run the example on the reps.tsv file I get ValueError: 'item' is not in the list, similarly when I try the progressives.tsv file I get ValueError: 'itemLabel' is not in list. The errors are otherwise identical, except the last few lines. I have tried restarting everything, and cannot get the init step to work before running the serve command. Any suggestions would be appreciated.

Last few lines of the error for reps.tsv:

  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 68, in init_db
    ididx = header.index(idcol)
ValueError: 'item' is not in list

Last few lines of the error for progressives.tsv:

  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 67, in init_db
    searchidx = header.index(searchcol)
ValueError: 'itemLabel' is not in list

The full error for reps.tsv:

(venv) C:\Users\jennifer.woodward\OneDrive - USDA\myGitOneDrive\NALT4MA\csv-reconcile>csv-reconcile init sample/reps.tsv item itemLabel
Traceback (most recent call last):
  File "C:\Program Files\Python39\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\Python39\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\jennifer.woodward\OneDrive - USDA\myGitOneDrive\NALT4MA\csv-reconcile\venv\Scripts\csv-reconcile.exe\__main__.py", line 7, in <module>
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\__init__.py", line 321, in main
    return cli()
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\click\core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\__init__.py", line 271, in init
    return doinit(config, scorerOption, csvfile, idcol, namecol)
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\__init__.py", line 259, in doinit
    initdb.init_db_with_context(csvfile, idcol, namecol)
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 95, in init_db_with_context
    return init_db(db,
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 68, in init_db
    ididx = header.index(idcol)
ValueError: 'item' is not in list

When used under Windows with OpenRefine 3.6.1 the service fails - The endpoint MUST return a JSON document describing the service, accessible vîa CORS or JSONP.

When used under Windows with OpenRefine 3.6.1 the service fails.
If I try using the OpenRefine Test Bench tab I get the following error: "The endpoint MUST return a JSON document describing the service, accessible vîa CORS or JSONP."

The output from the service from the test is:

* Serving Flask app 'csv-reconcile' (lazy loading)
* Environment: production
  WARNING: This is a development server. Do not use it in a production deployment.
  Use a production WSGI server instead.
* Debug mode: off
* Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
127.0.0.1 - - [07/Sep/2022 21:57:05] "OPTIONS / HTTP/1.1" 404 -
127.0.0.1 - - [07/Sep/2022 21:57:05] "GET / HTTP/1.1" 404 -
127.0.0.1 - - [07/Sep/2022 21:57:05] "OPTIONS /?callback=jsonp_1662580625915_97939 HTTP/1.1" 404 -
127.0.0.1 - - [07/Sep/2022 21:57:05] "GET /?callback=jsonp_1662580625915_97939 HTTP/1.1" 404 -

Might the OpenRefine protocol have changed?

CSV sniffer needs more data

Hello- I have a TSV file with line character counts as follows (the first line is the header)

155
130
656
416
707
950
526
753
186
731
...

csv-reconcile init gives me the following error:

$ poetry run csv-reconcile init test7.tsv col1_name col2_name

...
File "/home/me/src/csv-reconcile/csv_reconcile/initdb.py", line 88, in init_db
searchidx = header.index(searchcol)
^^^^^^^^^^^^^^^^^^^^^^^
ValueError: 'col2_name' is not in list

The error is fixed if I change the amount of data being fed to the sniffer on this line
dialect = csv.Sniffer().sniff(csvfile.read(10240))

where I changed the previous value of 1024 to be 10240.

Provide Preview Service

I am reconciling against local CSV files with ambiguous data in the search column and without ids from external systems (Wikidata...).

ID Search Column Additional Data
1 John Doe 1970-01-01
2 John Doe 2020-01-01
... ... ...

To identify the correct match from the Reconciliation API I have to manually check the proposed results against the data in the CSV files. To speed this process up I would prefer to use a Preview Service as defined in the Reconciliation Service API Specification.

Make preview service opt-in

Per the discussion in #28 make the preview service opt-in by simply pointing the manifest at it like so:

    "preview": {
       "url": "http://localhost:5000/preview/{{id}}",
       "width": 400,
       "height": 300
    }

Install instructions

I am a bit confused by the install instructions in the readme, which currently are (after cloning the repository):

$ python -m venv venv                                             # create virtualenv
$ venv/bin/pip install csv-reconcile                              # install package
$ source venv/bin/activate                                        # activate virtual environment
(venv) $ csv-reconcile --init-db sample/reps.tsv item itemLabel   # start the service
(venv) $ deactivate                                               # remove virtual environment

It seems to me that this installs csv-reconcile from PyPI (using the latest release) rather than from the code contained in the repository. What are the steps to run the code directly instead?

Generally speaking I expect the following workflow (but perhaps that's old-school!):

$ python -m venv venv
$ source venv/bin/activate     # activate the virtualenv before installing anything
$ pip install -r requirements.txt  # install dependencies
$ python setup.py install # install csv-reconcile itself
$ csv-reconcile …

Incorrect encoding detection

Hello,

at first, let me thank you for this great reconciliation tool!

I've been trying to use csv-reconcile with the csv-reconcile-geo plugin and I am not confident, where the error comes from so feel free to direct me elsewhere, if the problem does not occur at your site.

So, the problem was that the "budovy_wdqs.tsc" file I was providing was using UTF-8 encoding, while the program apparently expect it to be in cp-1250 for some reason. When I have resaved the .tsv in cp-1250, the bug went gone.

(venv) C:\...\csv-reconcile [master ≡ +4 ~0 -0 !]> csv-reconcile --init-db budovy_wdqs.tsv item coords --scorer geo --config config.txt
Traceback (most recent call last):
  File "C:\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
    exec(code, run_globals)
  File "C:\...\csv-reconcile\venv\Scripts\csv-reconcile.exe\__main__.py", line 7, in <module>
  File "C:\...\csv-reconcile\venv\lib\site-packages\click\core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "C:\...\csv-reconcile\venv\lib\site-packages\click\core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "C:\...\csv-reconcile\venv\lib\site-packages\click\core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\...\csv-reconcile\venv\lib\site-packages\click\core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "C:\...\csv-reconcile\venv\lib\site-packages\csv_reconcile\__init__.py", line 195, in main
    initdb.init_db_with_context()
  File "C:\...\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 90, in init_db_with_context
    return init_db(db,
  File "C:\...\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 58, in init_db
    header = next(reader)
  File "C:\Python310\lib\encodings\cp1250.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x88 in position 2094: character maps to <undefined>

Problem with CSV dialect sniffing

I tested the new behaviour for CSV dialect sniffing introduced for #41 in #42 and discovered the following problems:

  1. The csv.Sniffer().sniff() method will throw a "Could not determine delimiter" error in case it is unsure about a delimiter.
  2. I can not overwrite this behaviour via the CSVKWARGS configuration because it is applied later.

dialect = csv.Sniffer().sniff(csvfile.read(1024))

The reason for csv.Sniffer() being unsure about the delimiter is that while reading a fixed chunk of the csv file this chunk might end in the middle of a csv line and therefore the number of delimiters in this line is off.

Working example with whole file:

import csv
csv.Sniffer().sniff("a,b,c\n1,2,3")

Problematic example with only part of the file (throws error):

import csv
csv.Sniffer().sniff("a,b,c\n1,2")

So I would recommend to use dialect sniffing only (or additionaly?) when the user has not given explicit instructions on the dialect via CSVKWARGS and to use csvfile.readline() to avoid having a line cut somewhere.

Setting up csv-reconcile-geo

FWIW, if you don't mind running your own reconciliation service, I've just written a geo scoring plugin for csv-reconcile.

With this you could, say run a SPARQL query to find coordinate locations of points you're looking to match against, export that as a TSV file and use that to run csv-reconcile.

You can get the service up and running as simply as the following:

$ python -m venv serverenv
$ source serverenv/bin/activate
$ python -m pip install csv-reconcile
$ python -m pip install csv-reconcile-geo
$ csv-reconcile --init-db query.tsv item coord --scorer geo 

Here item is the name of the column containing the QID's and coord is the name of the coordinate column in well-known text format, the default export format for coordinates.

This was just my first pass at it. There's certainly room for improvement, but it may suit your immediate needs.

@gitonthescene Please could you assist me with this? I am a bit disoriented and I am not sure if I understand the overall idea of 'my own' reconciliation service correctly. Am I right in assuming that I need to load File number 1 into openrefine, load File number 2 into command line via the commands above, add a reconciliation service "http://127.0.0.1:5000/reconcile" to OpenRefine and reconcile?

I think I was able to start virtualenv on my system (I am on Windows and "source" did not work, but I think I was able to find a solution at https://stackoverflow.com/questions/8921188/issue-with-virtualenv-cannot-activate) and then I was able to install csv-reconcile and csv-reconcile-geo. However, this is what I get when I run the program:

(venv) C:\Users\vojte\Downloads>csv-reconcile --init-db query.tsv item coord --scorer geo
c:\users\vojte\venv\lib\site-packages\normality\__init__.py:72: ICUWarning: Install 'pyicu' for better text transliteration.
  text = ascii_text(text)
Traceback (most recent call last):
  File "C:\Users\vojte\AppData\Local\Programs\Python\Python37-32\Lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\vojte\AppData\Local\Programs\Python\Python37-32\Lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\vojte\venv\Scripts\csv-reconcile.exe\__main__.py", line 7, in <module>
  File "c:\users\vojte\venv\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "c:\users\vojte\venv\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "c:\users\vojte\venv\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\users\vojte\venv\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "c:\users\vojte\venv\lib\site-packages\csv_reconcile\__init__.py", line 210, in main
    initdb.init_db()
  File "c:\users\vojte\venv\lib\site-packages\csv_reconcile\initdb.py", line 76, in init_db
    (mid, word) + tuple(matchFields))
sqlite3.IntegrityError: UNIQUE constraint failed: reconcile.id
sqlite3.IntegrityError: UNIQUE constraint failed: reconcile.id

My query.tsv is from https://w.wiki/3BV9

What do you think is happening? Sorry to spam the issue with my questions

Originally posted by @VojtechDostal in wetneb/openrefine-wikibase#101 (comment)

Handle mismatched lines in csv files

We should expect all lines in the csv file to have the same number of entries. Skip lines which have a different number. Generally these will be trailing blank lines created by whatever generated the csv file.

Thank You!!

Please do feel free to close this :) I just wanted to drop a note about how pleasantly easy this was to use to join a few columns into my tables - thanks for the clear docs and easy install!!

Extending columns showing id and not name

When using "Add columns from reconciled values", you see the list of db column names to choose from and not the original csv column names. Using the original csv column names looks cleaner.

localhost:5000/reconcile not displaying properly

After executing the following commands:

Screen Shot 2022-05-03 at 2 23 26 PM

http://127.0.0.1:5000/reconcile displays as

Screen Shot 2022-05-03 at 2 24 29 PM

The following commands did not result in an error message and all executed successfully. What could the issue be?

I don't believe it's an issue described in #41. To test, I created a config file adding a line
"CSVENCODING = " with csvencoding as the encoding for the file and the localhost:5000/reconcile did not change.

Any suggestions would be greatly appreciated! Thanks!

Csv-reconcile geo does not suggest some close objects after OpenRefine reconciliation

I have the following object:

Point(14.6152142 50.0812828)

I am reconciling it to the outcome of this query:

https://w.wiki/3CDn

The correct match is Q64816168
However that suggestion does not appear in the top-ten. Any ideas why? Do I miss something obvious?

Steps to replicate:

  1. insert new project from clipboard with only this content:
    Point(14.6152142 50.0812828)
  2. start a reconciliation service as I just documented in #3 , using query.tsv from the query above
  3. reconcile the column using that service - top hit is Q64816166, which is close, but farther than Q64816168

The message after activating the reconciliation service in the terminal from csv-reconcile has incorrect service url.

While trying to run a reconciliation service with the help of csv-reconcile reconciliation service from a csv file in the terminal, it leads us to the message ->"Running on http://127.0.0.1:5000/". The url"http://127.0.0.1:5000/" gives an error message when we click it or when we try adding it in OpenRefine as a service, while the actual service url is given in the instructions of the csv-reconcile website.When we click on the actual service url it leads us to the Service manifest and it is also successfully added as a service in OpenRefine.
The url in the message should essentially redirect to the service url (which is http://127.0.0.1:5000/reconcile)which would be very convenient for the users

Reorganize command line options to accommodate running in a container

When running from a container, we should initialize the database when creating the container and simply run the service when running the container. The current command line interface obscures this distinction. Deprecate current syntax for something more accommodating. Namely, use sub-commands to "init" the database and "serve" the data. Keep a deprecated "run" sub-command which mimics the current behavior.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.