dedupeio / csvdedupe Goto Github PK

View Code? Open in Web Editor NEW

406.0 15.0 83.0 1.15 MB

:id: Command line tool for deduplicating CSV files

License: Other

Python 100.00%

dedupe cli record-linkage entity-resolution csv-files

csvdedupe's People

Contributors

Stargazers

Watchers

Forkers

jeremyjbowers pereiram malkomalko karlpilkington johnlauck math4youbyusgroupillinois nidhog ajschumacher timothyasp amitsquare tpatwood copyfun subraarya aboutaaron soundaryathiagarajan daveklassen abacusadvertising guyyos rlugojr dpfivemaples tspannhw carevoyance seancron samirfor wedwardbeck evandam bread12345 micbster leobouloc jowdones melvin15may alixaxel nacnudus kushagragupta024 ryanloney jeffoconnell bbrangeo ricgresia gbrian hellodhr reidab liuweiping2020 guptam joernhees noajshu tarunto afcarl jjackson21 hneaz scottstanfield anthony-grantbook shaunmsu dmells dajor sandroci lmcm18 ladysneezes szieglericf rpatil524 adamolekiewicz dr-riz virtualclarity 07pranav jonlevy rvega amarjitghuman jdvala dolicaakelloegwel zenifold stjordanis zacharysyoung wiktorek140 rakhithjk ilimikato qcasares ramonvo74 knownbymanoj tdl77 buelt-bq icyparsley blaisj

csvdedupe's Issues

Ability to pass in config_file

In some cases, the configuration will be more complicated than is practical to input on the command line. In this case, we should allow the user to provide a config_file instead.

dedupe_csv.py input.csv output.csv --config_file configuration_file

Split off from #1

Left and right join options for csvlink

We already have --inner_join, but would be useful to also have --left_join and --right_join too for when we want to also return items in one file not matched to the other.

Cannot allocate memory

I'm trying to use csvdedupe against a dataset of about 800k records. After blocking is complete I encounter an OSError: Cannot allocate memory. This is using Python 3.4 on an Ubuntu AWS instance with 7.5 GB of memory.

INFO:dedupe.blocking:820000, 260.8973972 seconds
INFO:dedupe.api:0 blocks
Traceback (most recent call last):
  File "/home/ubuntu/.virtualenvs/opa_parse/bin/csvdedupe", line 11, in <module>
    sys.exit(launch_new_instance())
  File "/home/ubuntu/.virtualenvs/opa_parse/lib/python3.4/site-packages/csvdedupe/csvdedupe.py", line 162, in launch_new_instance
    d.main()
  File "/home/ubuntu/.virtualenvs/opa_parse/lib/python3.4/site-packages/csvdedupe/csvdedupe.py", line 134, in main
    threshold = deduper.threshold(data_d, recall_weight=self.recall_weight)
  File "/home/ubuntu/.virtualenvs/opa_parse/lib/python3.4/site-packages/dedupe/api.py", line 204, in threshold
    return self.thresholdBlocks(blocked_pairs, recall_weight)
  File "/home/ubuntu/.virtualenvs/opa_parse/lib/python3.4/site-packages/dedupe/api.py", line 70, in thresholdBlocks
    self.num_cores)['score']
  File "/home/ubuntu/.virtualenvs/opa_parse/lib/python3.4/site-packages/dedupe/core.py", line 220, in scoreDuplicates
    [process.start() for process in map_processes]
  File "/home/ubuntu/.virtualenvs/opa_parse/lib/python3.4/site-packages/dedupe/core.py", line 220, in <listcomp>
    [process.start() for process in map_processes]
  File "/usr/lib/python3.4/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.4/multiprocessing/context.py", line 212, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/usr/lib/python3.4/multiprocessing/context.py", line 267, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.4/multiprocessing/popen_fork.py", line 21, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.4/multiprocessing/popen_fork.py", line 70, in _launch
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

When I run it with Python 2.7.6 it runs significantly slower, gets through about 670,000 rows and throws Segmentation fault (core dumped).

Is this just a problem with the specs of my machine? Is 7.5 GB just not enough memory for a 93 MB, 800k row file?

Thanks in advance for any help!

print() and input() statements not echoing to console

When I run cvslink, I get the following output:

INFO:root:Generating grammar tables from /usr/lib/python2.7/lib2to3/Grammar.txt
INFO:root:Generating grammar tables from /usr/lib/python2.7/lib2to3/PatternGrammar.txt
INFO:root:imported 9568 rows from file 1
INFO:root:imported 6790 rows from file 2
INFO:root:using fields: ['title', 'year', 'duration', 'director']
INFO:root:taking a sample of 1500 possible pairs
INFO:root:starting active labeling...
INFO:dedupe.api:Learned Weights
INFO:dedupe.api:('(title: String)', 8.127357362229678e-05)
INFO:dedupe.api:('(year: String)', 8.127357362229678e-05)
INFO:dedupe.api:('(duration: String)', 8.127357362229678e-05)
INFO:dedupe.api:('(director: String)', -2.464518083101114e-05)
INFO:dedupe.api:('bias', 16.064579514990712)
INFO:dedupe.training:1.0

and the program is waiting for input at convenience.py:42. I think that the stdout is getting swallowed up somewhere but I can't track it down. I can throw a breakpoint in there and log additional output no problem but my print statements get swallowed up.

csvlink command

Now that we have some data matching code in dedupe, it would be great to implement a new command csvlink

csvlink --fields_1 "PIPE DESCRIPTION" --fields_2 "LOCATION" ssma.csv mwrd.csv

Returns a csv that links all the records from different sources within a cluster together. The user would still need to delete bad matches.

Option to use previous settings file?

As I learn more about this set of scripts. I discovered there is quite a similar script to csvdedupe.py here:

https://github.com/datamade/dedupe-examples/blob/master/csv_example/csv_example.py

Looking at this script I like one specific ability not present in the library version, and this is the ability to save and use in consecutive runs, the dedupe training settings file (used in the last run). Essentially I want to have results that are consistent (and likely be able to judge one run as more successful than another). Was this left out of this specific script with cause, or simply because its not required?

skip_training configuration value overwritten

I was setting a value for skip_training in the configuration file, however this value is always overridden by the command line settings. The command lines settings are loaded onto the pre-loaded configuration file settings, and whenever this happens skip_training is always set to false, even when the command-line flag is not supplied to the command/invocation.

Having some UnicodeDecodeError errors using csvdedupe

Here's the traceback for the UnicodeDecodeError I was running into.

UnicodeDecodeError

Traceback (most recent call last):
  File "/Users/johria/.pyenv/versions/3.5.0/bin/csvdedupe", line 11, in <module>
    sys.exit(launch_new_instance())
  File "/Users/johria/.pyenv/versions/3.5.0/lib/python3.5/site-packages/csvdedupe/csvdedupe.py", line 161, in launch_new_instance
    d = CSVDedupe()
  File "/Users/johria/.pyenv/versions/3.5.0/lib/python3.5/site-packages/csvdedupe/csvdedupe.py", line 33, in __init__
    self.input = open(self.configuration['input'], 'rU').read()
  File "/Users/johria/.pyenv/versions/3.5.0/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 546689: invalid continuation byte

In case it means anything, I opened the file with excel and first saw this error message:

Alert
Excel has detected that 'DC_contribs_since_2007.csv' is a SYLK file, but cannot load it. Either the file has errors or it is not a SYLK file format. Click OK to try to open the file in a different format.

After opening the file in Sublime and doing File -> Save with Encoding -> UTF-8, everything worked.

Not sure if you have any ideas regarding what happened, but thought I'd drop an issue!

The file I was working with is this one: https://github.com/codefordc/dc-campaign-finance-watch/blob/develop/DC_contribs_since_2007.csv

Train mode and Dedupe mode

Implement at Train Mode and a Dedupe mode.

Screen refresh flash using curses

Screen refreshes for each comparison. I believe this is due to the way we are passing in one uncertain_pair at a time, which kicks off a new curses.wrapper

TypeError: expected string or buffer

$ csvlink all-hotels.csv gta-new-hotels.csv --config_file=csvdedupe-config.json

INFO:root:Generating grammar tables from /usr/lib/python2.7/lib2to3/Grammar.txt
INFO:root:Generating grammar tables from /usr/lib/python2.7/lib2to3/PatternGrammar.txt
Traceback (most recent call last):
File "/usr/local/bin/csvlink", line 11, in
sys.exit(launch_new_instance())
File "/usr/local/lib/python2.7/dist-packages/csvdedupe/csvlink.py", line 169, in launch_new_instance
d.main()
File "/usr/local/lib/python2.7/dist-packages/csvdedupe/csvlink.py", line 74, in main
prefix='input_1')
File "/usr/local/lib/python2.7/dist-packages/csvdedupe/csvhelpers.py", line 52, in readData
clean_row = {k: preProcess(v) for (k, v) in row.items()}
File "/usr/local/lib/python2.7/dist-packages/csvdedupe/csvhelpers.py", line 52, in
clean_row = {k: preProcess(v) for (k, v) in row.items()}
File "/usr/local/lib/python2.7/dist-packages/csvdedupe/csvhelpers.py", line 29, in preProcess
column = re.sub(' +', ' ', column)
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer

Recipes, joining messing data

Want to align to data sets:

Business licenses and restaurant inspection

Return confidence of matching

30k rows causes KeyError. core.randomPairs generate invalid keys.

For large dataset, a KeyError is generated at line 46 of Labeler.py. The reason appears that the core.randomPairs call at line 41 of Labeler.py generates keys at are out of bound of the unique dataset. The dataset I was using had not ID set and was 30k rows.

I modified randomPairs to generate random pair using numpy.random.random_integer(0, lengthOfData, sample_size) for each of the item in the returned tupled; this worked.

Can I specify the CSV delimiter?

It would be an interesting addition if we could specify the CSV delimiter as a command line parameter.

csvdedupe <configs> --delimiter ";" input_dot_comma.csv

Release 1

Determine config_file format

Before we implement #3, figure out how we'd like to format our configuration.

Some options:

JSON format
python list/dictionary

TypeError: expected string or buffer

I'm getting the following errors when trying to run csvdedupe with my own dataset. After following dedupe examples, I think it has to do with the preProcess function.

Traceback (most recent call last):
  File "/usr/app/anaconda/bin/csvdedupe", line 11, in <module>
    sys.exit(launch_new_instance())
  File "/usr/app/anaconda/lib/python2.7/site-packages/csvdedupe/csvdedupe.py", line 148, in launch_new_instance
    d.main()
  File "/usr/app/anaconda/lib/python2.7/site-packages/csvdedupe/csvdedupe.py", line 69, in main
    data_d = csvhelpers.readData(self.input, self.field_names)
  File "/usr/app/anaconda/lib/python2.7/site-packages/csvdedupe/csvhelpers.py", line 52, in readData
    clean_row = {k: preProcess(v) for (k, v) in row.items()}
  File "/usr/app/anaconda/lib/python2.7/site-packages/csvdedupe/csvhelpers.py", line 52, in <dictcomp>
    clean_row = {k: preProcess(v) for (k, v) in row.items()}
  File "/usr/app/anaconda/lib/python2.7/site-packages/csvdedupe/csvhelpers.py", line 29, in preProcess
    column = re.sub('  +', ' ', column)
  File "/usr/app/anaconda/lib/python2.7/re.py", line 155, in sub
    return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer

Checking if the column is already None and casting it to a string first seems to solve the issue.

Thanks in advance!

frozendict is gone...

This line

https://github.com/datamade/csvdedupe/blob/c0b5ba9894c3535987223582ab63b96d47917a40/csvdedupe/csvhelpers.py#L58

references some code that has been removed from the core library.

[Errno 2] No such file or directory

Hi there, I keep on getting this error across different environments (pycharm, bash) and with dedupe and csvlink. Using python 2.7 -- the 2 files I've tried to link are small (<2K rows, <120KB). Ran nosetests -- they all passed. Was running OSX El Capitan, just upgraded to Sierra. Thanks for your help -- let me know if I can provide any additional information!

/Users/AUGUSTUS/.virtualenvs/musky/lib/python2.7/site-packages/dedupe/backport.py:17: UserWarning: NumPy linked against 'Accelerate.framework'. Multiprocessing will be disabled. http://mail.scipy.org/pipermail/numpy-discussion/2012-August/063589.html
  warnings.warn("NumPy linked against 'Accelerate.framework'. "
INFO:dedupe.api:100 records
INFO:dedupe.api:200 records
INFO:dedupe.api:300 records
INFO:dedupe.api:400 records
INFO:dedupe.api:500 records
INFO:dedupe.api:600 records
INFO:dedupe.api:700 records
INFO:dedupe.api:800 records
INFO:dedupe.api:900 records
Traceback (most recent call last):
  File "/Users/AUGUSTUS/.virtualenvs/musky/bin/csvlink", line 11, in <module>
    sys.exit(launch_new_instance())
  File "/Users/AUGUSTUS/.virtualenvs/musky/lib/python2.7/site-packages/csvdedupe/csvlink.py", line 208, in launch_new_instance
    d.main()
  File "/Users/AUGUSTUS/.virtualenvs/musky/lib/python2.7/site-packages/csvdedupe/csvlink.py", line 153, in main
    recall_weight=self.recall_weight)
  File "/Users/AUGUSTUS/.virtualenvs/musky/lib/python2.7/site-packages/dedupe/api.py", line 378, in threshold
    return self.thresholdBlocks(blocked_pairs, recall_weight)
  File "/Users/AUGUSTUS/.virtualenvs/musky/lib/python2.7/site-packages/dedupe/api.py", line 71, in thresholdBlocks
    self.num_cores)['score']
  File "/Users/AUGUSTUS/.virtualenvs/musky/lib/python2.7/site-packages/dedupe/core.py", line 212, in scoreDuplicates
    fillQueue(record_pairs_queue, records, n_map_processes)
  File "/Users/AUGUSTUS/.virtualenvs/musky/lib/python2.7/site-packages/dedupe/core.py", line 245, in fillQueue
    chunk = list(itertools.islice(iterable, int(chunk_size)))
  File "/Users/AUGUSTUS/.virtualenvs/musky/lib/python2.7/site-packages/dedupe/api.py", line 394, in <genexpr>
    pairs = (product(base, target) for base, target in blocks)
  File "/Users/AUGUSTUS/.virtualenvs/musky/lib/python2.7/site-packages/dedupe/api.py", line 438, in _blockData
    os.remove(file_path)
OSError: [Errno 2] No such file or directory: '/var/folders/6l/k5yl32r92qz1v6jfwljhbcjw0000gn/T/tmpwapJz4'```

consoleLabelere reading and writing from stderr

Upstream, consoleLabel is going to read and write from stdin and stdout. For csvdedupe, we need a version that reads and writes to stderr. dedupeio/dedupe#308

Output strings with commas not getting quoted

As I was working through the lobbyists vs contracts example, I think I may have stumbled upon a genuine bug. To reproduce, do something like I did over here (I added the example files to the repo) and then take a look at the output. You'll notice things like this showing up:

544,lobbyists,Big Chicago, Inc.,3000 W IRVING PARK RD,,Chicago,IL,60618, ,
151,lobbyists,AECOM USA, Inc.,303 E. Wacker Sr., #600,,Chicago,IL,60601, ,

where on the way in (from the abbreviated file I made with csvcut) they looked like:

$ csvgrep -c "CLIENT NAME" -m "Big Chicago" lobbyists.csv
CLIENT NAME,CLIENT ADDRESS,CLIENT ADDRESS 2,CLIENT CITY,CLIENT STATE,CLIENT ZIP,Award Amount
"Big Chicago, Inc.",3000 W IRVING PARK RD,,Chicago,IL,60618,
$ csvgrep -c "CLIENT NAME" -m "AECOM USA" lobbyists.csv
CLIENT NAME,CLIENT ADDRESS,CLIENT ADDRESS 2,CLIENT CITY,CLIENT STATE,CLIENT ZIP,Award Amount
"AECOM USA, Inc.","303 E. Wacker Sr., #600",,Chicago,IL,60601,

I can take a look at fixing this in the morning, if you like.

IndexError: index -1 is out of bounds for axis 0 with size 0

Files for reproduction: dedupe.zip

$ docker run -it --rm -v $PWD:/data -w /data samirfor/csvdedupe-docker --config_file config.json ticmix.csv

INFO:root:imported 12 rows
INFO:root:using fields: ['titulo', 'data', 'hora', 'cidade_estado']
INFO:root:taking a sample of 150000 possible pairs
/usr/local/lib/python3.5/site-packages/dedupe/sampling.py:39: UserWarning: 75000 blocked samples were requested, but only able to sample 55
  % (sample_size, len(blocked_sample)))
INFO:root:reading labeled examples from training.json
INFO:dedupe.api:reading training from file
INFO:root:starting active labeling...

INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (fingerprint, titulo), TfidfNGramCanopyPredicate: (0.2, data))
INFO:root:caching training result set to file learned_settings
INFO:root:blocking...
INFO:root:finding a good threshold with a recall_weight of 2
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.2, data)
Traceback (most recent call last):
  File "/usr/local/bin/csvdedupe", line 9, in <module>
    load_entry_point('csvdedupe==0.1.14', 'console_scripts', 'csvdedupe')()
  File "/usr/local/lib/python3.5/site-packages/csvdedupe-0.1.14-py3.5.egg/csvdedupe/csvdedupe.py", line 180, in launch_new_instance
  File "/usr/local/lib/python3.5/site-packages/csvdedupe-0.1.14-py3.5.egg/csvdedupe/csvdedupe.py", line 127, in main
  File "/usr/local/lib/python3.5/site-packages/dedupe/api.py", line 235, in threshold
    return self.thresholdBlocks(blocked_pairs, recall_weight)
  File "/usr/local/lib/python3.5/site-packages/dedupe/api.py", line 76, in thresholdBlocks
    recall = expected_dupes / expected_dupes[-1]
IndexError: index -1 is out of bounds for axis 0 with size 0

/data # pip3 list
affinegap (1.10)
affinegap (1.10)
BTrees (4.3.2)
canonicalize (1.3)
categorical-distance (1.9)
csvdedupe (0.1.14)
dedupe (1.6.0)
dedupe-hcluster (0.3.2)
DoubleMetaphone (0.1)
fastcluster (1.1.22)
future (0.16.0)
haversine (0.4.5)
highered (0.2.1)
Levenshtein-search (1.4.2)
nose (1.3.7)
numpy (1.11.3)
persistent (4.2.2)
pexpect (4.2.1)
pip (9.0.1)
ptyprocess (0.5.1)
pyhacrf-datamade (0.2.0)
PyLBFGS (0.2.0.3)
rlr (2.4)
setuptools (20.10.1)
simplecosine (1.1)
simplejson (3.10.0)
six (1.10.0)
zope.index (4.2.0)
zope.interface (4.3.3)
/data # python
Python 3.5.2 (default, Dec 27 2016, 21:33:11) 
[GCC 5.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> sys.maxsize
9223372036854775807

put csvdedupe on pypi

I have it registered.
https://pypi.python.org/pypi/csvdedupe

Interactive mode and piping to stdout

Could we catch "the >" arg and pipe to stdout

improve training labeling UI

The plan: using curses and getch. Fall back to consoleLabel for platforms that don't support these (Windows).

side by side comparison
use red/green colors for diffs
highlight exact matches
highlight missing fields

'Introducing csvdedupe' Source post

This doc link from the readme no longer exists

No longer around..
Read more about csvdedupe on OpenNews Source

Unhashable type error when destructive option is enabled

Can't get my head around why this is happening. The write functions in the csvhelper.py appear to be similar but when I want to output just the unique records, I get an 'unhashable type: numpy.ndarray'.

Traceback (most recent call last): File "c:\python34\lib\runpy.py", line 170, in _run_module_as_main "__main__", mod_spec) File "c:\python34\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Python34\Scripts\csvdedupe.exe\__main__.py", line 9, in <module> File "c:\python34\lib\site-packages\csvdedupe\csvdedupe.py", line 148, in launch_new_instance d.main() File "c:\python34\lib\site-packages\csvdedupe\csvdedupe.py", line 137, in main write_function(clustered_dupes, self.input, output_file) File "c:\python34\lib\site-packages\csvdedupe\csvhelpers.py", line 107, in writeUniqueResults cluster_membership[record_id] = cluster_id TypeError: unhashable type: 'numpy.ndarray'

Anyone else face this issue? Any workarounds? Thanks

Fix broken link in README

These return errors:

https://cruel-carlota.pagodabox.com/88cda639ab635a100d23de5948ffbef5
http://githalytics.com/datamade/csvdedupe

Option to specify fields by number

Some CSV files don't have headers, and it's somewhat awkward to add them just to use csvlink. So something like:

$ csvlink file1.csv file2.csv --field_numbers_1 3,4 --field_numbers_2 1,2

Would also be great to have the option of single-letter arguments to make everything more concise. Also the long argument names should probably use hyphens instead as in other Unix utilities.

README Edits.

Needs some explanatory text to frame commands at beginning.
Broken link at bottom.
Add these:
Team
Errors and Bugs
Patches and Pull Requests
Copyright and Attribution
change "csvdedupe usage" to "Getting Started"

Return deduplicated csv --flag

From cluster of dedupes return deduplicate row.

Allow multiple files to dedupe

Should csvdedupe support unicode?

We can have affine gap handle unicode characters, but it's not clear to me that we should.

Basically, it comes to choosing between

distance("Tomas", "Tomás')  = distance("Tomas", "Thomas")

distance("Thomas, "Thomás") = distance("Thomas", "Thomzs")

Problems using csvdupe and running nosetests

I have come across this issue in many attempts to use the script. In the most recent case I cloned this repo (and dedupe) and followed the "Testing" instructions in the README files. Completing the tests for dedupe did not surface any issues, however testing csvdedupe produced this output:

[user@localhost csvdedupe]# nosetests
....E.....
======================================================================
ERROR: test_no_training (test_command_line.TestCSVDedupe)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3.4/site-packages/pexpect/spawnbase.py", line 144, in read_nonblocking
    s = os.read(self.child_fd, size)
OSError: [Errno 5] Input/output error

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.4/site-packages/pexpect/expect.py", line 97, in expect_loop
    incoming = spawn.read_nonblocking(spawn.maxread, timeout)
  File "/usr/lib/python3.4/site-packages/pexpect/pty_spawn.py", line 455, in read_nonblocking
    return super(spawn, self).read_nonblocking(size)
  File "/usr/lib/python3.4/site-packages/pexpect/spawnbase.py", line 149, in read_nonblocking
    raise EOF('End Of File (EOF). Exception style platform.')
pexpect.exceptions.EOF: End Of File (EOF). Exception style platform.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/src/csvdedupe/tests/test_command_line.py", line 26, in test_no_training
    child.expect("error: You need to provide an existing training_file or run this script without --skip_training")
  File "/usr/lib/python3.4/site-packages/pexpect/spawnbase.py", line 315, in expect
    timeout, searchwindowsize, async)
  File "/usr/lib/python3.4/site-packages/pexpect/spawnbase.py", line 339, in expect_list
    return exp.expect_loop(timeout)
  File "/usr/lib/python3.4/site-packages/pexpect/expect.py", line 102, in expect_loop
    return self.eof(e)
  File "/usr/lib/python3.4/site-packages/pexpect/expect.py", line 49, in eof
    raise EOF(msg)
pexpect.exceptions.EOF: End Of File (EOF). Exception style platform.
<pexpect.pty_spawn.spawn object at 0x7f4d16efa860>
command: /bin/csvdedupe
args: ['/bin/csvdedupe', 'examples/csv_example_messy_input.csv', '--field_names', 'Site name', 'Address', 'Zip', 'Phone', '--training_file', 'foo.json', '--skip_training']
searcher: None
buffer (last 100 chars): b''
before (last 100 chars): b'ltiprocessing/pool.py", line 599, in get\r\n    raise self._value\r\nZeroDivisionError: float division\r\n'
after: <class 'pexpect.exceptions.EOF'>
match: None
match_index: None
exitstatus: None
flag_eof: True
pid: 6063
child_fd: 6
closed: False
timeout: 30
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 2000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0.05
delayafterclose: 0.1
delayafterterminate: 0.1

----------------------------------------------------------------------
Ran 10 tests in 50.870s

FAILED (errors=1)

The error reported:

ZeroDivisionError: float division

is the exact same error I received when first attempting to execute against my own csv file. A better traceback of the problem occurs when I run this command:

[user@localhost deduper]$ csvdedupe examples/csv_example_messy_input.csv --field_names "Site name" Address Zip Phone --output_file output.csv
INFO:root:imported 3337 rows
INFO:root:using fields: ['Site name', 'Address', 'Zip', 'Phone']
INFO:root:taking a sample of 1500 possible pairs
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/lib64/python3.4/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib64/python3.4/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/usr/lib64/python3.4/site-packages/dedupe/datamodel.py", line 83, in distances
    record_2[field])
  File "affinegap/affinegap.pyx", line 115, in affinegap.affinegap.normalizedAffineGapDistance (affinegap/affinegap.c:1788)
  File "affinegap/affinegap.pyx", line 134, in affinegap.affinegap.normalizedAffineGapDistance (affinegap/affinegap.c:1615)
ZeroDivisionError: float division
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/bin/csvdedupe", line 11, in <module>
    sys.exit(launch_new_instance())
  File "/usr/lib/python3.4/site-packages/csvdedupe/csvdedupe.py", line 162, in launch_new_instance
    d.main()
  File "/usr/lib/python3.4/site-packages/csvdedupe/csvdedupe.py", line 86, in main  
    deduper.sample(data_d, self.sample_size)
  File "/usr/lib64/python3.4/site-packages/dedupe/api.py", line 931, in sample
    self._loadSample(data_sample)
  File "/usr/lib64/python3.4/site-packages/dedupe/api.py", line 873, in _loadSample
    self.num_cores)
  File "/usr/lib64/python3.4/site-packages/dedupe/training.py", line 52, in __init__
    2))
  File "/usr/lib64/python3.4/multiprocessing/pool.py", line 260, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/lib64/python3.4/multiprocessing/pool.py", line 599, in get
    raise self._value
ZeroDivisionError: float division

I have encountered the errors using python2.7 and python 3.4 on a centos7 VM, cygwin, and Python4Windows. In each instance I have managed to get numpy to install. Since the error appears to have ended in the multiprocessing libraries. perhaps I have not properly configured python? However I have tried many different combinations already. Any hints around how I can execute this test successfully?:

csvdedupe examples/csv_example_messy_input.csv --field_names "Site name" Address Zip Phone --output_file output.csv

We should pass through unicode output files

While we use ASCII dammit for deduplicating, to write the final output we should read and write unicode.

Feedback from Derek Willis

API for command line

dedupe input.csv output.csv --field-names="foo,bar,baz" --pr=1.0

dedupe input.csv output.csv --training-file = training.csv

--training file is optional, if not present than active learn
--active IF present and training file presesnt then active learn on top of training file

Recipes, reproduce work of early childhood with standard linux commands, csvkit, and csvdedupe

Inhomogenous files
Stack
Dedupe
Select one canonical (top one)

Records do not line up with data model

INFO:root:imported 269277 rows from file 1
INFO:root:imported 36467 rows from file 2
INFO:root:using fields: [u'id', u'geo_latitude', u'geo_longitude', u'star_rating_value', u'name', u'city', u'country', u'chain_name', u'type', u'address', u'fax', u'email', u'website']
INFO:root:taking a sample of 15000 possible pairs
Traceback (most recent call last):
  File "/usr/local/bin/csvlink", line 11, in <module>
    sys.exit(launch_new_instance())
  File "/usr/local/lib/python2.7/site-packages/csvdedupe/csvlink.py", line 169, in launch_new_instance
    d.main()
  File "/usr/local/lib/python2.7/site-packages/csvdedupe/csvlink.py", line 119, in main
    deduper.sample(data_1, data_2, self.sample_size)
  File "/Library/Python/2.7/site-packages/dedupe/api.py", line 876, in sample
    self._checkData(data_1, data_2)
  File "/Library/Python/2.7/site-packages/dedupe/api.py", line 910, in _checkData
    self.data_model.check(next(iter(viewvalues(data_2))))
  File "/Library/Python/2.7/site-packages/dedupe/datamodel.py", line 123, in check
    "in a record" % field)
ValueError: Records do not line up with data model. The field 'website' is in data_model but not in a record

UnicodeEncodeError when running 0.1.10

I'm using csvdedupe v0.1.10 on Ubuntu 12.04 (Python 2.7.11), installed via pip.

Consider test.csv, which is encoded in UTF-8

"ID";"Name_ID";"Name";"Interessengruppe";"Branche";"Stiftung"
1;1;"Wiler Parkhaus AG";"Individualverkehr";"Energie, Umwelt & Mobilität";0
2;2;"Stefan Kölliker Treuhand & Unternehmensberatung";"Advokaturen/Treuhand";"Beratung, Advokatur, PR & Treuhand";0
3;3;"HB-Gruppe";"Investmentgesellschaften";"Finanzwirtschaft & Versicherungen";0

and dedupe_config.json

{
  "field_names": ["Name", "Interessengruppe", "Branche", "Stiftung"],
  "field_definitions" : [{"field" : "Name", "type" : "String"},
                        {"field" : "Interessengruppe", "type" : "String"},
                        {"field" : "Branche", "type" : "String"},
                        {"field" : "Stiftung", "type": "Categorical",
                        "categories" : ["0", "1"]}
                        ],
  "output_file": "examples/organisations_deduped.csv",
  "skip_training": false,
  "training_file": "training.json",
  "sample_size": 150000,
  "recall_weight": 2
}

Now, when running csvdedupe --config_file dedupe_config.json test.csv, I get the following error: UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 70: ordinal not in range(128)`

\xe4 corresponds to "ä" in "Mobilität" as far as I know.

I thought this should work with unicodecsv.
Thanks!

Allow comments in config JSON file?

Seems to be a bit of a contentious issue ...

http://stackoverflow.com/questions/244777/can-i-comment-a-json-file

https://plus.google.com/118095276221607585885/posts/RK8qyGVaGSr

Too many values to unpack error in consoleLabel, line 183, csvhelpers.py

Getting an error here:

for field in set(field for field, compare in deduper.data_model.field_comparators):

Turns out the field_comparators list contains 4-tuple elements, not 2-tuples, so the fix should be something like:

for field in set(field for field, compare, unused_1, unused_2 in deduper.data_model.field_comparators):

on error, training.json is not saved

ValueError: No predicate found! We could not learn a single good predicate. Maybe give Dedupe more training data

Catch ValueError and save training.json so user can go back and add more training data.

Documentation for command line tool

cannot import name 'SIGPIPE'

Hi, I have written the following code:

from csvdedupe import csvlink

and get the exception

ImportError was unhandled by user code
cannot import name 'SIGPIPE'

I am under Windows 7 and using Visual Studio 2015, Version 14 and Python Version 3.5.
I already tried this solution but it didn't work since signal does not appear to have the member SIGPIPE.

Is there a way to fix this?
Thank you.

Combine redundant field_names and field_definition settings

Don't need both in config:

{
  "field_names": ["Site name", "Address", "Zip", "Phone"],
  "field_definition" : {"Site name" : {"type" : "String"},
                        "Address"   : {"type" : "String"},
                        "Zip"       : {"type" : "String",
                                       "Has Missing" : true},
                        "Phone"     : {"type" : "String",
                                       "Has Missing" : true}}
}

but we should support field_names as a command line parameter and default to String type w/o 'Has Missing'

add pypi install instructions to Readme

pip install csvdedupe and all that

Rename dedupe-csv to csvdedupe

To keep with our inspirations, let's rename dedupe-csv to csvdedupe