Code Monkey home page Code Monkey logo

pysimstr's Introduction

Build Status

#PySimStr

Fast(ish) string similarity for one vs many comparisons.

Solves the problem of fuzzy searching many different (unknown in advance) strings, one at a time, against a relatively large constant collection of strings that fits in memory many times over.

Example problem:

some_big_collection = ['Foo', 'bar', 'Something Else' ...]

import Levenshtein

def compare_bruteforce(s_to_compare, some_big_collection, threshold):
    for element in some_big_collection:
        score = Levenshtein.jaro_winkler(s_to_compare, element)
        if score >= threshold:
            return True
    return False

As an example of real-world performance, this library speeds up string lookup ~10^4 times when searching for a string in a collection of 10^5 entities with a trigram index and plus-minus 3-letter length difference when using Jaro-Winkler comparison function.

Usage Example:

    >>> from pysimstr import SimStr
    >>> db = SimStr(idx_size=3, plus_minus=8, cutoff=0.85)
    >>> db.insert(('Harry Potter And The Big Wizard Guy',  # I have not read HP
                   'Game Of Thrones',
                   'Mad Max'))
    >>> db.check("Harry Potter and the Sorcerer's Stone")  # True
    >>> db.check("Harry Something")  # False
    >>> db.retrieve("Mad Monkey")  # ['Mad Max']
    >>> db.retrieve_with_score('Mad Monkey')  # [('Mad Max',
                                              #   0.8690476190476191)]

Speedup is achieved by an n-gram indexing strategy and only comparing strings of similar length.

Most useful with Levenshtein-like distance functions that take a while to compute.

####Note Comparisons can take a lot longer if the new strings are large compared to index size. If your incoming dataset has variable sized strings such as strings of only 3 characters in length and strings of 20 characters in length, you should make several instances of the SimStr class with different index sizes.

Compare the larger strings against instances with larger index sizes.

pysimstr's People

Contributors

lqdc avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

pombredanne

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.