Code Monkey home page Code Monkey logo

jellyfish's Introduction

jellyfish's People

Contributors

ahood avatar antoinerondelet avatar dependabot-preview[bot] avatar dependabot[bot] avatar diego-plan9 avatar dimitripapadopoulos avatar dmckean avatar dparrol avatar fernand0 avatar heirecka avatar j535d165 avatar jamesturk avatar jimmyshah avatar juliangilbey avatar karatheodory avatar layday avatar loisaidasam avatar martinomensio avatar maxbachmann avatar mikejs avatar nchammas avatar odidev avatar ofek avatar peterscott avatar timgates42 avatar viccuad avatar wtcross avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jellyfish's Issues

Please include docs/ and testdata/ in PyPi's tar.gz

Hi,

I'm packaging jellyfish for Debian (in fact it is already done)
(there's another person already doing it), and
it would be nice if the PyPi tar.gzs shipped with the docs included
(primarly because of the changelog).

Cheers and thanks

NYSIIS is broken

import jellyfish as jf
names = [ 'Catherine', 'Katherine', 'Katarina',
          'Johnathan', 'Jonathan', 'John',
          'Teresa', 'Theresa',
          'Smith', 'Smyth',
          'Jessica',
          'Joshua',
          ]

for n in names:
    print '%-10s' % n, jf.nysiis(n)
...
Catherine  CATARAN
Katherine  CATARAN
Katarina   CATARAN
Johnathan  JAONATAN
Jonathan   JANATAN
John       JAON
Teresa     TARAS
Theresa    TTARAS
Smith      SNAT
Smyth      SNYT
Jessica    JASAC
Joshua     JAS
>>>

It should return

Catherine  CATARAN
Katherine  CATARAN
Katarina   CATARAN
Johnathan  JANATAN
Jonathan   JANATAN
John       JAN
Teresa     TARAS
Theresa    TARAS
Smith      SNATH
Smyth      SNATH
Jessica    JASAC
Joshua     JAS

metaphone returns different results for same word, different case

Found an instance where metaphone will return different results based on case. My quick research indicates that metaphone should be case-insensitive.

>>> import jellyfish
>>> jellyfish.metaphone('kentucky')
'KNTK'
>>> jellyfish.metaphone('KENTUCKY')
'KNTKK'

JELLYFISH IS UNUSABLE!!!!!!

ld returned 1 exit status

Hello,

I am attempting to install a module that is dependent upon jellyfish, but I can't seem to get jellyfish to install. I have tried to install using pip and from source. I get the same error every time:

collect2.exe: error: ld returned 1 exit status
error: command 'gcc' failed with exit status 1

I have been unable to find a thread discussing this issue with jellyfish, and I am afraid I don't yet know enough to modify remedies used for different modules. Any thoughts?

Please see the install output below:

C:\Users\choct155\Python\Modules\jellyfish\jellyfish-0.2.0>gcc --version
gcc (tdm-1) 4.7.1
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

C:\Users\choct155\Python\Modules\jellyfish\jellyfish-0.2.0>python setup.py insta
ll
running install
running bdist_egg
running egg_info
writing jellyfish.egg-info\PKG-INFO
writing top-level names to jellyfish.egg-info\top_level.txt
writing dependency_links to jellyfish.egg-info\dependency_links.txt
reading manifest file 'jellyfish.egg-info\SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'jellyfish.egg-info\SOURCES.txt'
installing library code to build\bdist.win32\egg
running install_lib
running build_ext
building 'jellyfish' extension
C:\MinGW32\bin\gcc.exe -mdll -O -Wall -IC:\Python27\include -IC:\Python27\PC -c
jellyfishmodule.c -o build\temp.win32-2.7\Release\jellyfishmodule.o
jellyfishmodule.c:319:5: warning: initialization from incompatible pointer type
[enabled by default]
jellyfishmodule.c:319:5: warning: (near initialization for 'jellyfish_methods[0]
.ml_meth') [enabled by default]
jellyfishmodule.c:323:5: warning: initialization from incompatible pointer type
[enabled by default]
jellyfishmodule.c:323:5: warning: (near initialization for 'jellyfish_methods[1]
.ml_meth') [enabled by default]
jellyfishmodule.c:327:5: warning: initialization from incompatible pointer type
[enabled by default]
jellyfishmodule.c:327:5: warning: (near initialization for 'jellyfish_methods[2]
.ml_meth') [enabled by default]
C:\MinGW32\bin\gcc.exe -mdll -O -Wall -IC:\Python27\include -IC:\Python27\PC -c
jaro.c -o build\temp.win32-2.7\Release\jaro.o
jaro.c: In function '_jaro_winkler':
jaro.c:52:5: warning: implicit declaration of function 'alloca' [-Wimplicit-func
tion-declaration]
jaro.c:52:17: warning: incompatible implicit declaration of built-in function 'a
lloca' [enabled by default]
C:\MinGW32\bin\gcc.exe -mdll -O -Wall -IC:\Python27\include -IC:\Python27\PC -c
hamming.c -o build\temp.win32-2.7\Release\hamming.o
C:\MinGW32\bin\gcc.exe -mdll -O -Wall -IC:\Python27\include -IC:\Python27\PC -c
levenshtein.c -o build\temp.win32-2.7\Release\levenshtein.o
C:\MinGW32\bin\gcc.exe -mdll -O -Wall -IC:\Python27\include -IC:\Python27\PC -c
damerau_levenshtein.c -o build\temp.win32-2.7\Release\damerau_levenshtein.o
C:\MinGW32\bin\gcc.exe -mdll -O -Wall -IC:\Python27\include -IC:\Python27\PC -c
mra.c -o build\temp.win32-2.7\Release\mra.o
C:\MinGW32\bin\gcc.exe -mdll -O -Wall -IC:\Python27\include -IC:\Python27\PC -c
soundex.c -o build\temp.win32-2.7\Release\soundex.o
C:\MinGW32\bin\gcc.exe -mdll -O -Wall -IC:\Python27\include -IC:\Python27\PC -c
metaphone.c -o build\temp.win32-2.7\Release\metaphone.o
C:\MinGW32\bin\gcc.exe -mdll -O -Wall -IC:\Python27\include -IC:\Python27\PC -c
nysiis.c -o build\temp.win32-2.7\Release\nysiis.o
nysiis.c: In function 'nysiis':
nysiis.c:13:5: warning: implicit declaration of function 'alloca' [-Wimplicit-fu
nction-declaration]
nysiis.c:13:18: warning: incompatible implicit declaration of built-in function
'alloca' [enabled by default]
C:\MinGW32\bin\gcc.exe -mdll -O -Wall -IC:\Python27\include -IC:\Python27\PC -c
porter.c -o build\temp.win32-2.7\Release\porter.o
porter.c: In function 'step5':
porter.c:362:7: warning: suggest parentheses around '&&' within '||' [-Wparenthe
ses]
writing build\temp.win32-2.7\Release\jellyfish.def
C:\MinGW32\bin\gcc.exe -shared -s build\temp.win32-2.7\Release\jellyfishmodule.o
build\temp.win32-2.7\Release\jaro.o build\temp.win32-2.7\Release\hamming.o buil
d\temp.win32-2.7\Release\levenshtein.o build\temp.win32-2.7\Release\damerau_leve
nshtein.o build\temp.win32-2.7\Release\mra.o build\temp.win32-2.7\Release\sounde
x.o build\temp.win32-2.7\Release\metaphone.o build\temp.win32-2.7\Release\nysiis
.o build\temp.win32-2.7\Release\porter.o build\temp.win32-2.7\Release\jellyfish.
def -LC:\Python27\libs -LC:\Python27\PCbuild -lpython27 -o build\lib.win32-2.7\j
ellyfish.pyd
build\temp.win32-2.7\Release\jellyfishmodule.o:jellyfishmodule.c:(.text+0x188):
undefined reference to _imp___Py_TrueStruct' build\temp.win32-2.7\Release\jellyfishmodule.o:jellyfishmodule.c:(.text+0x191): undefined reference to_imp___Py_ZeroStruct'
build\temp.win32-2.7\Release\jellyfishmodule.o:jellyfishmodule.c:(.text+0x2ab):
undefined reference to _imp__PyExc_TypeError' build\temp.win32-2.7\Release\jellyfishmodule.o:jellyfishmodule.c:(.text+0x5f8): undefined reference to_imp__PyExc_TypeError'
collect2.exe: error: ld returned 1 exit status
error: command 'gcc' failed with exit status 1

C:\Users\choct155\Python\Modules\jellyfish\jellyfish-0.2.0>

damerau_levenshtein_distance returns error

The damerau_levenshtein_distance method returns an error. A google search did not return any solutions.
Thanks.

jellyfish.damerau_levenshtein_distance('Pop Country', 'Country Pop')
Traceback (most recent call last):
File "", line 1, in
TypeError: must be cannot convert raw buffers, not str

sudo pip install jellyfish fails

Downloading/unpacking jellyfish
Downloading jellyfish-0.2.0.tar.gz
Running setup.py egg_info for package jellyfish

Installing collected packages: jellyfish
Running setup.py install for jellyfish
building 'jellyfish' extension
gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/include/python2.7 -c jellyfishmodule.c -o build/temp.linux-i686-2.7/jellyfishmodule.o
jellyfishmodule.c:1:20: fatal error: Python.h: No such file or directory
compilation terminated.
error: command 'gcc' failed with exit status 1
Complete output from command /usr/bin/python -c "import setuptools;file='/var/www/build/jellyfish/setup.py';exec(compile(open(file).read().replace('\r\n', '\n'), file, 'exec'))" install --single-version-externally-managed --record /tmp/pip-Qwrcft-record/install-record.txt:
running install

running build

running build_ext

building 'jellyfish' extension

creating build

creating build/temp.linux-i686-2.7

gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/include/python2.7 -c jellyfishmodule.c -o build/temp.linux-i686-2.7/jellyfishmodule.o

jellyfishmodule.c:1:20: fatal error: Python.h: No such file or directory

compilation terminated.

error: command 'gcc' failed with exit status 1

Metaphone Error on WH

metaphone function fails when run on the string "WH" because it advances past the end of the string.

Python3 support?

Is there python 3 support for this module? If so, could we get that added to the PyPi record? It's useful as various compatibility tools and pages check this information.

Unicode support

The metaphone and soundex methods support unicode strings, but no other method seems to.

e.g.

>>> match = jellyfish.jaro_winkler(u'éabc', 'údef')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)

repeatable crash in new damerau implementation

pip install -e .; python -c "import jellyfish; print jellyfish.damerau_levenshtein_distance(u'test', u'test')"

crashes w/ a MemoryError

initial debugging points to the malloc((len1+2)) * cols * sizeof(size_t)) call

Cannot find *.c files

When trying to install, I get the messages:

warning: no files found matching '.c'
warning: no files found matching '
.h'

I verified that the directory "cjellyfish" exists and has the .c and .h files

The install works, but only because it uses the python code.

metaphone differences between PyPy and CPython

jellyfish returns different metaphone encodings when used on PyPy and CPython. Issue was discovered in python-us (unitedstates/python-us#13) and reproduced below:

Python 3.5.1:

>>> jellyfish.metaphone('Utah')
'UT'

Python 2.7.11:

>>> jellyfish.metaphone(u'Utah')
'UT'

PyPy 4.0.1:

>>>> jellyfish.metaphone(u'Utah')
u'UTH'

exception IndexError

try input:
name1='Pedro Pablo Valdeben Petersen'
name2='Pedro Pablo Valdebenito Petersen'

IndexError: string index out of range

Hi,

I get an error while finding the metaphone of certain words.
"Aapti", "Aarti"

The error is
File "build\bdist.win32\egg\jellyfish_jellyfish.py", line 425, in metaphone
IndexError: string index out of range

String Functions should operate on wchar_t instead of char

First, I'd like to say that this library is awesome. It has all the greats in one place and doesn't try to do anything too fancy. The implementations are clean as well.

Currently, the library doesn't really have support for unicode strings. If I try to submit a unicode string with a non-ascii character, I get a traceback:

>>> jellyfish.hamming_distance(u'\u725b'.encode('utf8'), u'\u4faf')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u4faf' in position 0: ordinal not in range(128)

In addition, when one does encode them so that they will appropriately convert to an encoded string (non UCS-4 unicode), the incorrect answer is given. This example should give 1, rather than 3:

>>> jellyfish.hamming_distance(u'\u725b'.encode('utf8'), u'\u4faf'.encode('utf8'))
3

This obviously doesn't make sense for things like soundex or other english only algorithms, but as a general rule, python libraries should take only unicode objects for string operations.

Patching this library to support unicode objects, rather than string objects shouldn't be too bad. You just need to replace the PyString_FromString with PyUnicode_FromWideChar and update the functions to use wchar_t. Here is the python c api unicode reference:

http://docs.python.org/c-api/unicode.html

If you want, I can fork this and send you a pull request. Just let me know.

jaro_winkler() docstring ignore_case=True

The Python docstring for the jaro_winkler() function says that it takes three arguments and the third is optional and defaults to ignoring case. Not only does it not take a third argument, the function does not ignore case. Is this a new feature that is being worked on?

TypeError: expected unicode, got str

I am getting this error when running any of the functions in the readme:

>>> import jellyfish
>>> jellyfish.metaphone('Jellyfish')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: expected unicode, got str

I am not sure if this is an error in jellyfish or a more general python error
I have tried with python 2.6 and 2.7

Is there something I can do to fix it?

Segmentation fault with jellyfish.match_rating_comparison

Hello,

Thanks a lot for your package, it's been really useful to me so far. However, I'm getting a segfault when using jellyfish.match_rating_comparison multiple times:

import jellyfish
import hashlib
sha1s = np.array([hashlib.sha1(str(v)).hexdigest() for v in range(100)])
r = [[jellyfish.match_rating_comparison(h1, h2) for h1 in sha1s] for h2 in sha1s]
r = [[jellyfish.match_rating_comparison(h1, h2) for h1 in sha1s] for h2 in sha1s]
[1]    70352 segmentation fault  ipython

Are you aware of this issue?

Thanks.

n

change to MIT license

jellyfish is excellent, I'm using it in Julia with some success.

I was thinking of porting it to Julia properly, eg. creating a registered package, however Julia is trying hard not be constrained by the BSD license.

Is there any possibility of changing the license to MIT?

jaro_winkler() takes no keyword arguments

jellyfish.jaro_winkler(unicode_str1, unicode_str2, long_tolerance=True) returns Error:
TypeError: jaro_winkler() takes no keyword arguments.

Am I doing something wrong, because long_tolerance seems like it should be valid keyword argument in this implementation:
def jaro_winkler(s1, s2, long_tolerance=False):...

Python 2 examples on README.rst

A small issue noticed during the Debian packaging process: it seems that the README "Example Usage" section contains some examples that, if executed directly, result in a minor error when using the Python 2 version:

>>> import jellyfish
>>> jellyfish.levenshtein_distance('jellyfish', 'smellyfish')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "jellyfish/_jellyfish.py", line 13, in levenshtein_distance
raise TypeError(_no_bytes_err)
TypeError: expected unicode, got str

>>> jellyfish.levenshtein_distance(u'jellyfish', u'smellyfish')
2

It seems that this by design - if that is the case, I'm wondering if the README.rst file should be updated to reflect the right usage of the arguments (always forcing the user to use unicode instead of str)?

Can't install

Dear Creators,

I am a novice developer trying to work with jelly fish on a mac but I can't quite get it to work.

Downloading/unpacking jellyfish
Downloading jellyfish-0.2.0.tar.gz
Running setup.py egg_info for package jellyfish

Installing collected packages: jellyfish
Running setup.py install for jellyfish
building 'jellyfish' extension
gcc-4.2 -fno-strict-aliasing -fno-common -dynamic -arch i386 -arch x86_64 -g -O2 -DNDEBUG -g -O3 -I/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c jellyfishmodule.c -o build/temp.macosx-10.6-intel-2.7/jellyfishmodule.o
unable to execute gcc-4.2: No such file or directory
error: command 'gcc-4.2' failed with exit status 1
Complete output from command /Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python -c "import setuptools;file='/tmp/pip-build/jellyfish/setup.py';exec(compile(open(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /tmp/pip-k8bXvN-record/install-record.txt --single-version-externally-managed:
running install

running build

running build_ext

building 'jellyfish' extension

creating build

creating build/temp.macosx-10.6-intel-2.7

gcc-4.2 -fno-strict-aliasing -fno-common -dynamic -arch i386 -arch x86_64 -g -O2 -DNDEBUG -g -O3 -I/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c jellyfishmodule.c -o build/temp.macosx-10.6-intel-2.7/jellyfishmodule.o

unable to execute gcc-4.2: No such file or directory

error: command 'gcc-4.2' failed with exit status 1


Command /Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python -c "import setuptools;file='/tmp/pip-build/jellyfish/setup.py';exec(compile(open(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /tmp/pip-k8bXvN-record/install-record.txt --single-version-externally-managed failed with error code 1 in /tmp/pip-build/jellyfish
Storing complete log in /Users/Vishaal/.pip/pip.log

Any insight would be awesome!
Thanks!
Vishaal

string encoding parameter

If textual data is normally handled in Unicode, it must first be encoded in utf-8 format in order to ensure avoidance of UnicodeEncodeError during string comparisons. In particular, for the jaro_winkler string comparison:

import jellyfish
name = u'Francisco Alarc\xben'
similarity = jellyfish.jaro_winkler(name, name)
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-48-c671ac4ac12e> in <module>()
----> 1 jellyfish.jaro_winkler(p.full_name, p.full_name)

UnicodeEncodeError: 'ascii' codec can't encode character u'\xbe' in position 15:     ordinal not in range(128)

This can be avoided by first encoding in utf-8:

name = name.encode('utf-8')
similarity = jellyfish.jaro_winkler(name, name)

... but it would be nice to have an option for this instead:

similarity = jellyfish.jaro_winkler(name, name, 'utf-8')

Damerau-Levensthein distance doesn't work for higher unicode symbols

Damerau-Levenshtein distance can not be calculated via cjellyfish extension if any of two words contains some unicode symbols(e.g. russian letters, some surrogates such as ŭ, etc)

Python 2.7.11+ (default, Apr 17 2016, 14:00:29) 
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import jellyfish.cjellyfish as cjellyfish
>>> import jellyfish._jellyfish as pyjellyfish
>>> pyjellyfish.damerau_levenshtein_distance(u'хлеб', u'пиво')
4
>>> cjellyfish.damerau_levenshtein_distance(u'хлеб', u'пиво')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: Encountered unsupported code point in string.
>>> cjellyfish.damerau_levenshtein_distance(u'tets', u'test')
1
>>> 
$ pip show jellyfish

---
Metadata-Version: 2.0
Name: jellyfish
Version: 0.5.6

nysiis() struggles with non-ASCII characters that soundex() and metaphone() can handle

Here's a simple repro with the non-ASCII character "ç":

>>> jellyfish.nysiis('ç')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: character U+ffffffff is not in range [U+0000; U+10ffff]
>>> jellyfish.soundex('ç')
'C000'
>>> jellyfish.metaphone('ç')
'K'

Is this expected?

I'm on Jellyfish 0.5.6 / Python 3.5.2.

Confusing error messages when passing incorrect arguments

Hello,

Passing incorrect arguments can be confusing in jellyfish. See the following:

Python 2.7.11 |Anaconda 4.0.0 (x86_64)| (default, Dec  6 2015, 18:57:58) 

In [1]: from jellyfish._jellyfish import soundex, nysiis

In [2]: nysiis('jamesturk')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-f6b8e2a0ef35> in <module>()
----> 1 nysiis('jamesturk')

/Users/jonathandebruin/anaconda/lib/python2.7/site-packages/jellyfish/_jellyfish.pyc in nysiis(s)
    216 def nysiis(s):
    217     if isinstance(s, bytes):
--> 218         raise TypeError(_no_bytes_err)
    219     if not s:
    220         return ''

TypeError: expected unicode, got str

In [3]: nysiis(u'jamesturk')
Out[3]: u'JANASTARC'

In [4]: nysiis([u'jamesturk', u'Jonathan'])
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-d24d43c545bb> in <module>()
----> 1 nysiis([u'jamesturk', u'Jonathan'])

/Users/jonathandebruin/anaconda/lib/python2.7/site-packages/jellyfish/_jellyfish.pyc in nysiis(s)
    220         return ''
    221 
--> 222     s = s.upper()
    223     key = []
    224 

AttributeError: 'list' object has no attribute 'upper'

In [5]: nysiis(10)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-0a8a2a0920ef> in <module>()
----> 1 nysiis(10)

/Users/jonathandebruin/anaconda/lib/python2.7/site-packages/jellyfish/_jellyfish.pyc in nysiis(s)
    220         return ''
    221 
--> 222     s = s.upper()
    223     key = []
    224 

AttributeError: 'int' object has no attribute 'upper'

Instead of raising in case of bytes, it is maybe better to raise if not unicode for python2 or str for python3. Like this:

    if IS_PY3 and not isinstance(s, str):
        raise TypeError('expected str or unicode, got %s' % type(s).__name__)
    elif not IS_PY3 and not isinstance(s, unicode):
        raise TypeError('expected unicode, got %s' % type(s).__name__)

This is the new output:

In [2]: from jellyfish._jellyfish import soundex, nysiis

In [3]: nysiis('jamesturk')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-f6b8e2a0ef35> in <module>()
----> 1 nysiis('jamesturk')

/Users/jonathandebruin/GitHub/recordlinkage/jellyfish/jellyfish/_jellyfish.py in nysiis(s)
    221 
    222 def nysiis(s):
--> 223     _check_type(s)
    224 
    225     if not s:

/Users/jonathandebruin/GitHub/recordlinkage/jellyfish/jellyfish/_jellyfish.py in _check_type(s)
     13         raise TypeError('expected str or unicode, got %s' % type(s).__name__)
     14     elif not IS_PY3 and not isinstance(s, unicode):
---> 15         raise TypeError('expected unicode, got %s' % type(s).__name__)
     16 
     17 def levenshtein_distance(s1, s2):

TypeError: expected unicode, got str

In [4]: nysiis(u'jamesturk')
Out[4]: u'JANASTARC'

In [5]: nysiis([u'jamesturk', u'Jonathan'])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-d24d43c545bb> in <module>()
----> 1 nysiis([u'jamesturk', u'Jonathan'])

/Users/jonathandebruin/GitHub/recordlinkage/jellyfish/jellyfish/_jellyfish.py in nysiis(s)
    221 
    222 def nysiis(s):
--> 223     _check_type(s)
    224 
    225     if not s:

/Users/jonathandebruin/GitHub/recordlinkage/jellyfish/jellyfish/_jellyfish.py in _check_type(s)
     13         raise TypeError('expected str or unicode, got %s' % type(s).__name__)
     14     elif not IS_PY3 and not isinstance(s, unicode):
---> 15         raise TypeError('expected unicode, got %s' % type(s).__name__)
     16 
     17 def levenshtein_distance(s1, s2):

TypeError: expected unicode, got list

In [6]: nysiis(10)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-0a8a2a0920ef> in <module>()
----> 1 nysiis(10)

/Users/jonathandebruin/GitHub/recordlinkage/jellyfish/jellyfish/_jellyfish.py in nysiis(s)
    221 
    222 def nysiis(s):
--> 223     _check_type(s)
    224 
    225     if not s:

/Users/jonathandebruin/GitHub/recordlinkage/jellyfish/jellyfish/_jellyfish.py in _check_type(s)
     13         raise TypeError('expected str or unicode, got %s' % type(s).__name__)
     14     elif not IS_PY3 and not isinstance(s, unicode):
---> 15         raise TypeError('expected unicode, got %s' % type(s).__name__)
     16 
     17 def levenshtein_distance(s1, s2):

TypeError: expected unicode, got int

Bye, Jonathan

stdbool.h issue during setup (Windows)

Hey guys, I don't expect Windows to be a priority, but C99 isn't supported with visual studio any more, and the install of Jellyfish hits this error:

jellyfish-master\cjellyfish\jellyfish.h(4) : fatal error C1083: Cannot open include file: 'stdbool.h': No such file or directory

There's a lot of info out there to workaround this issue, but maybe you've already solved it?

Thanks,

-James

hamming distance isn't?

need to figure out if hamming distance is right, should it not work for strings of different length?

damerau_levenshtein.c segfault

First off, thanks for the library! It's been great.

Now the issue--
I've been getting a segfault (reproduced on both Ubuntu and OS X) in damerau_levenshtein_distance, called from the python library. Strings being compared were "mylifeoutdoors" and "нахлыст".

Poking around in gdb, it looks like the unicode characters cause a bad lookup in the "da" array. If the Cyrillic characters are out of scope for this library, would you be opposed to a change to detect out of bounds code points?

Jellyfish version: 0.5.3

Stack trace:
#0 0x00007feea2c5a340 in damerau_levenshtein_distance (s1=0x7fee9aeee630, s2=0x7fee9aea3eb0, len1=14, len2=7)

at cjellyfish/damerau_levenshtein.c:58
    infinite = 21
    cols = 9
    i = 1
    j = 7
    i1 = 140662777395312
    j1 = 0
    db = 0
    d1 = 7
    d2 = 7
    d3 = 8
    d4 = <error reading variable d4 (Cannot access memory at address 0x23fb1b90e40100)>
    result = <optimized out>
    dist = 0x18c2180
    da = 0x1c7c510

#1 0x00007feea2c597b6 in jellyfish_damerau_levenshtein_distance (self=, args=)

at cjellyfish/jellyfishmodule.c:156
    s1 = 0x7fee9aeee630
    s2 = 0x7fee9aea3eb0
    len1 = 14
    len2 = 7
    result = <optimized out>

damerau_levenshtein_distance fails occasionally

I'm getting an index out of range error on a specific combination of two strings that is hard to reproduce with other string pairs. I've tried quickly to debug it, but I just can't wrap my head around the algorithm quickly enough. This works in v0.2.2.

damerau_levenshtein_distance('cape sand recycling ', 'edith ann graham') --> list index out of range error, line 34, in _levenshtein_distance

Note the space at the end of string #1. If the strings are reversed, no error is thrown.

Revise EOL encoding of "testdata/porter.csv"

Would it be possible to revise the EOL encoding of testdata/porter.csv as it is the only file with CRLF line terminators on the repository?

jellyfish-testdata$ file *
damerau_levenshtein.csv:     UTF-8 Unicode text
hamming.csv:                 ASCII text
jaro_distance.csv:           UTF-8 Unicode text
jaro_winkler.csv:            ASCII text
levenshtein.csv:             ASCII text
match_rating_codex.csv:      ASCII text
match_rating_comparison.csv: ASCII text
metaphone.csv:               UTF-8 Unicode text
nysiis.csv:                  ASCII text
porter.csv:                  ASCII text, with CRLF line terminators
README.md:                   ASCII text
soundex.csv:                 UTF-8 Unicode text

While it should be properly handled by a well-configured local git, it is also carried on into the tarballs and causes some slight annoyances when diffing the tarball against the repository. I'm including a pull request agains jellyfish-testdata that hopefully solves the problem in a quick and painless manner. Thanks in advance!

jellyfish from PyPI won't install

Good job with the release of the 0.3 version, but I have some issues installing it on Ubuntu 14.04.

$ python --version
Python 2.7.6
$ pip install jellyfish
Downloading/unpacking jellyfish
  Downloading jellyfish-0.3.0.tar.gz
  Running setup.py (path:~/.virtualenvs/pu/build/jellyfish/setup.py) egg_info for package jellyfish

    warning: no files found matching '*.c'
    warning: no files found matching '*.h'
Installing collected packages: jellyfish
  Running setup.py install for jellyfish
    building 'jellyfish.cjellyfish' extension
    x86_64-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/include/python2.7 -c cjellyfish/jellyfishmodule.c -o build/temp.linux-x86_64-2.7/cjellyfish/jellyfishmodule.o
    cjellyfish/jellyfishmodule.c:3:23: fatal error: jellyfish.h: No such file or directory
     #include "jellyfish.h"
                           ^
    compilation terminated.
    ***************************************************************************
    WARNING: C extension could not be compiled, falling back to pure Python.
    ***************************************************************************

    warning: no files found matching '*.c'
    warning: no files found matching '*.h'
    ***************************************************************************
    WARNING: C extension could not be compiled, falling back to pure Python.
    ***************************************************************************
Successfully installed jellyfish
Cleaning up...
$ python -c "import jellyfish"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: No module named jellyfish

The dev version works just fine tough.

$ pip install -e git://github.com/sunlightlabs/jellyfish.git#egg=jellyfish-dev
    Creating ~/.virtualenvs/pu/lib/python2.7/site-packages/jellyfish.egg-link (link to .)
    Adding jellyfish 0.3.0 to easy-install.pth file

    Installed ~/.virtualenvs/pu/src/jellyfish
Successfully installed jellyfish
$ python -c "import jellyfish"
$ # profit!

No .h were found in the .tar.gz from PyPI

$ wget https://pypi.python.org/packages/source/j/jellyfish/jellyfish-0.3.0.tar.gz
$ tar -xvzf jellyfish-0.3.0.tar.gz 
jellyfish-0.3.0/
jellyfish-0.3.0/cjellyfish/
jellyfish-0.3.0/cjellyfish/jellyfishmodule.c
jellyfish-0.3.0/cjellyfish/levenshtein.c
jellyfish-0.3.0/cjellyfish/hamming.c
jellyfish-0.3.0/cjellyfish/soundex.c
jellyfish-0.3.0/cjellyfish/jaro.c
jellyfish-0.3.0/cjellyfish/metaphone.c
jellyfish-0.3.0/cjellyfish/porter.c
jellyfish-0.3.0/cjellyfish/damerau_levenshtein.c
jellyfish-0.3.0/cjellyfish/nysiis.c
jellyfish-0.3.0/cjellyfish/mra.c
jellyfish-0.3.0/jellyfish/
jellyfish-0.3.0/jellyfish/test.py
jellyfish-0.3.0/jellyfish/porter.py
jellyfish-0.3.0/jellyfish/_jellyfish.py
jellyfish-0.3.0/jellyfish/__init__.py
jellyfish-0.3.0/jellyfish/compat.py
jellyfish-0.3.0/README.rst
jellyfish-0.3.0/setup.py
jellyfish-0.3.0/setup.cfg
jellyfish-0.3.0/PKG-INFO
jellyfish-0.3.0/jellyfish.egg-info/
jellyfish-0.3.0/jellyfish.egg-info/dependency_links.txt
jellyfish-0.3.0/jellyfish.egg-info/top_level.txt
jellyfish-0.3.0/jellyfish.egg-info/PKG-INFO
jellyfish-0.3.0/jellyfish.egg-info/SOURCES.txt
jellyfish-0.3.0/MANIFEST.in
jellyfish-0.3.0/LICENSE

Similarity functions for damerau_levenshtein, levenshtein and hamming

Hello Jamesturk,

What are your thoughts about adding similarity functions for damerau_levenshtein, levenshtein and hamming? Currently, only distance functions are available for these algorithms. I think, most of the applications use similarity functions.

What about:
damerau_levenshtein_similarity
levenshtein_similarity
hamming_similarity

Comparable with the R-package https://github.com/markvanderloo/stringdist.

I can make the Python ones if you like. C version does not look that hard either.

Kind regards, Jonathan

Installing jellyfish on Windows XP - Unable to find vcvarsall.bat

Hi!

Novice question ... I am trying to install jellyfish 0.2.0 on a Windows XP system running Python 2.7. I am using easy_install for the installation and am getting the following error ... Unable to find vcvarsall.bat

C:\Documents and Settings\Doug Caldwell>easy_install jellyfish
Searching for jellyfish
Reading http://pypi.python.org/simple/jellyfish/
Best match: jellyfish 0.2.0
Downloading https://pypi.python.org/packages/source/j/jellyfish/jell
tar.gz#md5=8f5d27bddd8986408f7004814982b202
Processing jellyfish-0.2.0.tar.gz
Running jellyfish-0.2.0\setup.py -q bdist_egg --dist-dir c:\docume1
cals
1\temp\easy_install-b_zfm2\jellyfish-0.2.0\egg-dist-tmp-4wy4nj
error: Setup script exited with error: Unable to find vcvarsall.bat

Any suggestions would be greatly appreciated!!

Thanks for the help ...

Doug Caldwell

Segfaults with nysiis

This is a difficult problem to replicate for me.
On my linux dev laptop, I don't get this crash issue, but on the one server (py2.6), and a colleagues OSX laptop (py2.7) the following causes a segfault:
jellyfish.nysiis('martincevic')

This is using jellyfish==0.2.1

It is also very easy to crash nysiis if you send any unicode to it (whereas metaphone copes by ignoring anything it doesn't understand)

I have very little time to help debug the issue right now, but at least I have a reliable test-case.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.