Code Monkey home page Code Monkey logo

stringosim's Introduction

stringosim

The plan for this package is to have Go implementation of different string distance/similarity functions, like Levenshtein (normalized, weighted, Damerau), Jaro-Winkler, Jaccard index, Euclidean distance, Hamming distance...

Currently it has implemented:

  • Levenshtein
  • Jaccard
  • Hamming
  • LCS
  • Q-gram
  • n-gram based Cosine distanc

Work in progress...

Import and installation

To get the library just run:

    go get github.com/dexyk/stringosim

To use the library just import it in your code:

    import "github.com/dexyk/stringosim"

To run the tests, go to the directory where stringosim package is installed and run:

    go test

Usage

Currently only Levenshtein, Jaccard, Hamming, LCS string, Q-gram and Cosine distances are implemented.

Levenshtein

Levenshtein distance can be calculated with default parameters (use DefaultSimilarityOptions) where cost of insert, delete and substitute operation are 1. You can also use it with other parameters by using SimilarityOptions type. Setting CaseInsensitive to true in SimilarityOptions the comparison will be done without considering character cases.

Example:

    fmt.Println(stringosim.Levenshtein([]rune("stringosim"), []rune("stingobim")))

    fmt.Println(stringosim.Levenshtein([]rune("stringosim"), []rune("stingobim"),
    stringosim.LevenshteinSimilarityOptions{
        InsertCost:     3,
        DeleteCost:     5,
        SubstituteCost: 2,
    }))

    fmt.Println(stringosim.Levenshtein([]rune("stringosim"), []rune("STRINGOSIM"),
    stringosim.LevenshteinSimilarityOptions{
        InsertCost:      3,
        DeleteCost:      4,
        SubstituteCost:  5,
        CaseInsensitive: true,
    }))

Jaccard

Jaccard distance can be calculated by setting the size of the n-gram which will be used for comparison. If the size is omitted the default value of 1 will be used.

Example:

    fmt.Println(stringosim.Jaccard([]rune("stringosim"), []rune("stingobim")))

    fmt.Println(stringosim.Jaccard([]rune("stringosim"), []rune("stingobim"), []int{2}))

    fmt.Println(stringosim.Jaccard([]rune("stringosim"), []rune("stingobim"), []int{3}))

Hamming

Hamming distance can be calculated with options. Default function will calculate standard hamming distance with case sensitive option. It can be also used without case sensitive option.

If the strings to compare have different lengths, the error will be returned.

Example:

    dis, _ := stringosim.Hamming([]rune("testing"), []rune("restink"))
    fmt.Println(dis)

    dis, _ = stringosim.Hamming([]rune("testing"), []rune("FESTING"), stringosim.HammingSimilarityOptions{
        CaseInsensitive: true,
    })
    fmt.Println(dis)

    _, err := stringosim.Hamming([]rune("testing"), []rune("testin"))
    fmt.Println(err)

Longest Common Subsequence (LCS)

LCS between two strings can be calculated with options. Default function will calculate the LCS with case insensitive option. It can be also used without case sensitive option.

Example:

    fmt.Println(stringosim.LCS([]rune("testing lcs algorithm"), []rune("another l c s example")))

    fmt.Println(stringosim.LCS([]rune("testing lcs algorithm"), []rune("ANOTHER L C S EXAMPLE"),
    stringosim.LCSSimilarityOptions{
        CaseInsensitive: true,
    }))

Jaro and Jaro-Winkler

Jaro and Jaro-Winkler can be calculated with options: case insensitive, and specific values for Jaro-Winkler - threshold, p value and l value.

Example:

    fmt.Println(stringosim.Jaro([]rune("abaccbabaacbcb"), []rune("bababbcabbaaca")))
    fmt.Println(stringosim.Jaro([]rune("abaccbabaacbcb"), []rune("ABABAbbCABbaACA"),
    stringosim.JaroSimilarityOptions{
        CaseInsensitive: true,
    }))

    fmt.Println(stringosim.JaroWinkler([]rune("abaccbabaacbcb"), []rune("bababbcabbaaca")))
    fmt.Println(stringosim.JaroWinkler([]rune("abaccbabaacbcb"), []rune("BABAbbCABbaACA"),
    stringosim.JaroSimilarityOptions{
        CaseInsensitive: true,
        Threshold:       0.7,
        PValue:          0.1,
        LValue:          4,
    }))

Q-gram

Q-gram distance can be calculated using default options (DefaultQGramOptions): length of q-grams is 2 and comparison is case sensitive. Using QGramSimilarityOptions as the parameter of the function we can set custom q-gram length and if the comparison is case sensitive or not.

Example:

    fmt.Println(stringosim.QGram([]rune("abcde"), []rune("abdcde")))

    fmt.Println(stringosim.QGram([]rune("abcde"), []rune("ABDCDE"),
    stringosim.QGramSimilarityOptions{
        CaseInsensitive: true,
        NGramSizes:     []int{3},
    }))

Cosine

Cosine distance can be calculated using default options (DefaultCosineOptions): length of n-grams is 2 and comparison is case sensitive. Using CosineSimilarityOptions as the parameter of the function we can set custom n-gram length and if the comparison is case sensitive or not.

Example:

    fmt.Println(stringosim.Cosine([]rune("abcde"), []rune("abdcde")))

    fmt.Println(stringosim.Cosine(Cosine[]rune("abcde"), []rune("ABDCDE"),
    stringosim.CosineSimilarityOptions{
        CaseInsensitive: true,
        NGramSizes:     []int{3},
    }))

stringosim's People

Contributors

dexyk avatar rsdoiel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.