lithammer / fuzzysearch Goto Github PK

View Code? Open in Web Editor NEW

1.1K 24.0 55.0 114 KB

:pig: Tiny and fast fuzzy search in Go

Home Page: https://pkg.go.dev/github.com/lithammer/fuzzysearch

License: MIT License

Go 100.00%

fuzzy-search go algorithm

fuzzysearch's Introduction

Fuzzy Search

Inspired by bevacqua/fuzzysearch, a fuzzy matching library written in JavaScript. But contains some extras like ranking using Levenshtein distance and finding matches in a list of words.

Fuzzy searching allows for flexibly matching a string with partial input, useful for filtering data very quickly based on lightweight user input.

The current implementation uses the algorithm suggested by Mr. Aleph, a russian compiler engineer working at V8.

Install

go get github.com/lithammer/fuzzysearch/fuzzy

Usage

package main

import "github.com/lithammer/fuzzysearch/fuzzy"

func main() {
	fuzzy.Match("twl", "cartwheel")  // true
	fuzzy.Match("cart", "cartwheel") // true
	fuzzy.Match("cw", "cartwheel")   // true
	fuzzy.Match("ee", "cartwheel")   // true
	fuzzy.Match("art", "cartwheel")  // true
	fuzzy.Match("eeel", "cartwheel") // false
	fuzzy.Match("dog", "cartwheel")  // false
	fuzzy.Match("kitten", "sitting") // false
	
	fuzzy.RankMatch("kitten", "sitting") // -1
	fuzzy.RankMatch("cart", "cartwheel") // 5
	
	words := []string{"cartwheel", "foobar", "wheel", "baz"}
	fuzzy.Find("whl", words) // [cartwheel wheel]
	
	fuzzy.RankFind("whl", words) // [{whl cartwheel 6 0} {whl wheel 2 2}]
	
	// Unicode normalized matching.
	fuzzy.MatchNormalized("cartwheel", "cartwhéél") // true

	// Case insensitive matching.
	fuzzy.MatchFold("ArTeeL", "cartwheel") // true
}

You can sort the result of a fuzzy.RankFind() call using the sort package in the standard library:

matches := fuzzy.RankFind("whl", words) // [{whl cartwheel 6 0} {whl wheel 2 2}]
sort.Sort(matches) // [{whl wheel 2 2} {whl cartwheel 6 0}]

See the fuzzy package documentation for more examples.

License

MIT

fuzzysearch's People

Contributors

Stargazers

Watchers

Forkers

tml jeanmatheussouto bantana colelawrence wangshijin ckome eliezedeck cradonn eeertekin mewbak v2e4lisp melignus aknysh venumeda evidlo nyaxt yl365 han-ian ktp-forked-repos richie12138 vladivolo ksdme acaloiaro davidji99 phandorin 3irdparty noho-digital tpam28 kokizzu lukaszraczylo errorboi chrislse superdili windhooked huangjingui 0xack13 coreyk ktprograms lbentea moman13 sijielim tang-hi aleksandarmuradyan pan-kant vineedkaladharan ben172 mohamedtaoufik altoplano np-at josharian dovanthao101189 iq-scm pksingh21 luis-sousa-pinto dicksonyuan

fuzzysearch's Issues

RankFind does not use Levenshtein distance to build results

RankFind still uses plain Find to determine initial results, then measures Levenshtein distance afterwards. Maybe we can have another RankFindLevenshtein that uses Levenshtein distance to determine results.

Expose Normalize function

Hello, I think creating the Normalize and NormalizaFold functions might be a good idea.
When you want to reuse a []target you don't need to normalize it again. Just normalize once and then normalize source and use Find().

New Release + upgrade to go 1.18.

Hi,

Thank you for maintaining this repo.

Was wondering if you could create a new release please? The last release seems to be using go 1.15 and was in Oct 2020.

Thanks.

Add case insensitive RankFind

Hi, thanks for this library!

Currently, it seems that MatchFold can be used for case insensitive matching, but there doesn't seem to be any way to do case insensitive RankFind.

Use of CMOV instruction in Levenshtein's minimum function

The min function used in the Levenshtein distance algorithm can be improved by using the CMOV instruction instead of conditional jumps. The updated function produces better performance than the original min function.

link: https://gcc.godbolt.org/z/EazrW9zKb

func min2(a, b int) int {
	if a < b {
		return a
	}
	return b
}

func min3(a, b, c int) int {
    return min2(min2(a, b), c)
}

link: https://gcc.godbolt.org/z/EazrW9zKb

Using MaxInt instead of -1

First of all, what an awesome lib! Thanks for the work. But I do have this design doubt:

Let's say I wanna compare strings two by two to a target using RankMatchNormalizedFold, for example:

s1 := "Hello Wor"
s2 := "Hello"
target := "hello world"
fmt.Println(fuzzy.RankMatchNormalizedFold(s1, target)) //2
fmt.Println(fuzzy.RankMatchNormalizedFold(s2, target)) //6

I could easily use math.Min(2,6) -> "Hello Wor" to find the best string in this case. But it would get me in trouble if any of the strings don't match, eg:

s1 := "foo"
s2 := "Hello"
target := "hello world"
fmt.Println(fuzzy.RankMatchNormalizedFold(s1, target)) //-1
fmt.Println(fuzzy.RankMatchNormalizedFold(s2, target)) //6

Where math.Min(-1,6) -> "foo"

Why using -1 instead of math.MaxInt if the lower the distance, the closer the strings?

Case insensetivity

@Renstrom , I'd like to add some facility for case insensitivity would that be find with you?

mishandling of utf8 replacement character

Add this test case to var fuzzyTests, and run the tests:

	{"\xffinvalid UTF-8\xff", "", false, -1},

Result:

--- FAIL: TestFuzzyMatchFold (0.00s)
panic: runtime error: slice bounds out of range [19:15] [recovered]
	panic: runtime error: slice bounds out of range [19:15]

goroutine 35 [running]:
testing.tRunner.func1.2({0x1031423e0, 0x14000114018})
	/Users/josh/go/1.20/src/testing/testing.go:1526 +0x1c8
testing.tRunner.func1()
	/Users/josh/go/1.20/src/testing/testing.go:1529 +0x384
panic({0x1031423e0, 0x14000114018})
	/Users/josh/go/1.20/src/runtime/panic.go:884 +0x204
golang.org/x/text/transform.String({0x103153b28, 0x103297c60}, {0x1030cb139, 0xf})
	/Users/josh/pkg/mod/golang.org/x/[email protected]/transform/transform.go:650 +0x9e4
github.com/lithammer/fuzzysearch/fuzzy.stringTransform({0x1030cb139, 0xf}, {0x103153b28?, 0x103297c60?})
	/Users/josh/x/fuzzysearch/fuzzy/fuzzy.go:242 +0x64
github.com/lithammer/fuzzysearch/fuzzy.match({0x1030cb139?, 0x7?}, {0x0, 0x0}, {0x103153b28, 0x103297c60})
	/Users/josh/x/fuzzysearch/fuzzy/fuzzy.go:55 +0x38
github.com/lithammer/fuzzysearch/fuzzy.MatchFold(...)
	/Users/josh/x/fuzzysearch/fuzzy/fuzzy.go:41
github.com/lithammer/fuzzysearch/fuzzy.TestFuzzyMatchFold(0x1400011cb60)
	/Users/josh/x/fuzzysearch/fuzzy/fuzzy_test.go:65 +0xbc
testing.tRunner(0x1400011cb60, 0x103152088)
	/Users/josh/go/1.20/src/testing/testing.go:1576 +0x10c
created by testing.(*T).Run
	/Users/josh/go/1.20/src/testing/testing.go:1629 +0x368
exit status 2
FAIL	github.com/lithammer/fuzzysearch/fuzzy	0.131s

This existed prior to #53 (phew!). The root cause is that unicodeFoldTransformer.Transform is returning n, n, err, but when utf8.RuneError is present, nSrc may differ from nDst. I'll try to put together a fix sometime soonish.

(Found by fuzzing. Once the fuzz tests make it out of the gate without stumbling, I'll PR them.)

Getting error:" github.com/lithammer/fuzzysearch: module github.com/lithammer/fuzzysearch@latest found (v1.1.2), but does not contain package github.com/lithammer/fuzzysearch"

When I run go mod tidy I get the following error:

go mod tidy
go: finding module for package github.com/lithammer/fuzzysearch
project/cmd imports
       github.com/lithammer/fuzzysearch: module github.com/lithammer/fuzzysearch@latest found (v1.1.2), but does not contain package github.com/lithammer/fuzzysearch

The import is not resolving correctly in my project. Any ideas?

support a method for returning matched indices for characters

like fzf's mark as green:

why don't we make this? :)

Document differences with bevacqua/fuzzysearch

Add function for returning the Levenshtein distance instead of true/false.
Has two functions for search many "haystacks".

panic in transform.go when it's trying to reslice

It seems it's trying to reslice with invalid index in v1.1.5

panic: runtime error: slice bounds out of range [5:3]

goroutine 399472 [running]:
golang.org/x/text/transform.String({0xe4c3d0, 0x13ce5c0}, {0xc00042d358, 0x3})
	/vendor/golang.org/x/text/transform/transform.go:650 +0xbe5
github.com/lithammer/fuzzysearch/fuzzy.stringTransform({0xc00042d358, 0x3}, {0xe4c3d0?, 0x13ce5c0?})
	/vendor/github.com/lithammer/fuzzysearch/fuzzy/fuzzy.go:243 +0x5a
github.com/lithammer/fuzzysearch/fuzzy.match({0xc00042d358?, 0x44fe72?}, {0xc0007a1950, 0x13}, {0xe4c3d0, 0x13ce5c0})

new type for targets

Hello everyone.
I want to ask you why this library uses string type for targets, not a Stringer interface or other. I think devs more often have to search in a slice of objects than in an array of raw strings. I suppose that this change type for targets will be useful.

Add Levenshtein distance ranking

https://en.wikipedia.org/wiki/Levenshtein_distance
https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#C

support match for reversed words

Can we get a implementation where abc xyz and xyz abc match

Consider implementing support of multi-byte character sets

Current implementation assumes that strings are indexed by characters, this assumption holds only for 1-byte UTF8 subset (latin alphabet); Go strings are UTF8 strings, so proper way of iterating over characters would be using range. Right now for non-latin strings this would search byte distance, rendering this package useless (in a non-obvious way).

Exact match mode?

In fzf, for example, there's so called "exact search mode", triggered by ', which searches the exact phrase provided. Would it be possible to have something like this in fuzzysearch?

I.e., as of now, searching the string ae in archive, arkenfox, aerc, zambezi will return all four candidates. However, if an option for strict matching could be added, the result would be only one, aerc.

Why needed? Well, because exactly of this, sometimes there are just too many candidates :)

What do you think?

RankMatch seems to be broken

README.md says:

fuzzy.RankMatch("kitten", "sitting") // 3

But RankMatch returns -1 always.

package main

import (
	"fmt"

	"github.com/lithammer/fuzzysearch/fuzzy"
)

func main() {
	src := "kitten"
	target := "sitting"
	m := fuzzy.RankMatch(src, target)
	fmt.Println(m)
}

-1

https://play.golang.org/p/llP1davLYrK

Change LevenshteinDistance to Damerau–Levenshtein distance?

Hi,
Damerau–LevenshteinDistance is the minimum number of edit operations ( insertions, deletions ,substitutions and transposition) required to change one word to another.It has one more action(transposition) compared to LevenshteinDistance.
I think Damerau–LevenshteinDistance is more practical than LevenshteinDistance. And it's time complexity is also O(N^2), but unfortunately it's space complexity is O(N^2) while LevenshteinDistance's space complexity is O(N)

Do you think it's an acceptable change? If so, I would like to raise a PR to change LevenshteinDistance to Damerau–Levenshtein distance.

Unecessary alocations due to transformer.Nop

I noticed that when doing Match with no normalization or folding the nop transformer does a unecessary roundtrip string -> []byte -> string, this results in 2 allocations, in my case these 2 allocs per operation end up taking 40% of the CPU time and adding a lot of GC pressure

I made a patch that fixes the issue by avoiding the transform.Nop (all tests passed, and the benchmarks confirmed 0 allocs)
noallocnop.txt

Introduce gopkg support

Hi,

I'm planning of using your library, but in order to have no future build-issues, I would like to pin versions to avoid the obvious problems. Would you mind creating branches/tags for your releases? The (simple) instructions are on: https://labix.org/gopkg.in

Cheers,

Mark