yougov / fuzzy Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
I am not sure that this is the best venue for this question, but I wanted to find out if the algorithms are written only for English at this point? Or better yet as Wikipedia puts it, American Soundex...I am just wondering how effective the algo would be applied to other languages and if there is no support, if there is a plan at this point.
Originally reported by: Alex Mikhalev (Bitbucket: alex_mikhalev, GitHub: Unknown)
Hello,
I found out that fuzzy can't handle unicode characters in unicode strings:
If I try to call Dmetaphone with product name:
Product name Blossom Hill White Zinfandel Rosé California (750ml)
Product name type <type 'unicode'>
I have error:
/lib/python2.7/site-packages/fuzzy.so in fuzzy.DMetaphone.call()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 28: ordinal not in range(128)
Strangely, if product name will be str with same values it can be dealt properly by any algorithms.
I understand that soundex/nyiis can't work on unicode characters, but they should be able to handle unicode passed as a string.
Originally reported by: Anonymous
I found that calling the soundex() changes the input string to capital. Even creating a deep copy cannot prevent the change.
My current solution is to create a new string then append the letters of input string to the new string one at a time, then use the new string as the input for soundex().
Originally reported here, there appears to be an issue where non-ascii inputs force a UnicodeEncodeError.
Originally reported by: Anonymous
Computing the Soundex for a string that matches an imported module seems to clobber that module's namespace. See below from my interactive shell:
#!python
>>> import datetime, fuzzy
>>> soundex = fuzzy.Soundex(4)
>>> datetime
<module 'datetime' from '/usr/lib/python2.6/lib-dynload/datetime.so'>
>>> soundex('datetime')
'D350'
>>> datetime
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'datetime' is not defined
Originally reported by: Brian (Bitbucket: eode, GitHub: eode)
#!python
import fuzzy
fdm = fuzzy.DMetaphone()
fdm10 = fuzzy.DMetaphone(10)
# note that this also trounces the 's' phoneme of 'decent'
>>> fdm('decent')
['TKNT', None]
>>> fdm('decentralization')
['TKNT', None]
>>> fdm10('decentralization')
['TKNT', None]
# ..for comparison:
import metaphone
mdm = metaphone.dm
>>> mdm('decent')
('TSNT', '')
>>> mdm('decentralization')
('TSNTRLSXN', '')
Expected behavior:
Originally reported by: Randy Ostler (Bitbucket: rando305, GitHub: rando305)
I love the NYSIIS for preventing duplicates in my customer database, but I need to move to Python 3. Any chance of that happening soon?
pip3 install fuzzy
fails with error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
Ubuntu Xenial all build-dependencies installed and up to date.
System python 3.5.2
package installs fine with system python 2.7.12
In #12, I added a new test for DMetaphone, but that test reveals that the function is emitting bytestrings instead of text values. I suspect it would be better for DMetaphone to return text.
Originally reported by: Doug Hellmann (Bitbucket: dhellmann, GitHub: dhellmann)
The soundex implementation modifies the characters of the input Python string, changing the case of the letters. It doesn't look like any of the other algorithms have this problem.
For example, this Python code:
#!python
import fuzzy
names = [ 'Catherine', 'Katherine', 'Katarina',
'Johnathan', 'Jonathan', 'John',
]
for n in names:
print n, fuzzy.Soundex(4)(n), n
produces this output:
$ python show_soundex.py
Catherine C365 CATHERINe
Katherine K365 KATHERINe
Katarina K365 KATARINa
Johnathan J535 JOHNATHAN
Jonathan J535 JONATHAN
John J500 JOHN
In this build, the tests are passing on Python 2.7 for bdb2890, but then one commit later, 6a9189a tagged for release as 1.2, the build fails. I re-ran the build today and it failed again the same way. As you can see from the diff, nothing significant changed. Yet the tests now fail. Why?
Originally reported by: Christopher Roudiez (Bitbucket: croudiez, GitHub: croudiez)
The key-value pair 'AY': 'Y' is included in the _nysiis_transforms dict. It should be in the _nysiis_suffix_map dict. This bug causes any string with a "Y" following a vowel to be miscoded. For example
FLOYD -> FLYD (should be FLAD)
SEYMOUR -> SYNAR (should be SANAR)
etc
Originally reported by: Anonymous
running Soundex on a string changes the original string to uppercase.
That's all well and good, but interestingly, it also changes a deep copy of the original string! that seems pretty wrong...
#!python
>>>x = "blabla"
>>>y = copy.deepcopy(x)
>>>sndex = fuzzy.soundex(32)
>>>print sndex(x)
B4140000000000000000000000000000
>>>print x
BLABLA
>>>print y
BLABLA
>>>#running soundex on x changes deep copy y!
Using the test case, in python 3.5:
phrase = 'FancyFree'
print(repr(fuzzy.Soundex(4)(phrase)))
yields: ''
Occasionally instead of yielding an empty string, it yields a unicode error. dmeta
and nysiis
are working fine in this install, so I don't believe it was an install error.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.