matthieugomez / stringdistances.jl Goto Github PK

View Code? Open in Web Editor NEW

134.0 4.0 18.0 423 KB

String Distances in Julia

License: Other

Julia 100.00%

string-distance levenshtein-distance similarity-score edit-distances winkler qgram-distance fuzzywuzzy

stringdistances.jl's Introduction

Installation

The package is registered in the General registry and so can be installed at the REPL with ] add StringDistances.

Supported Distances

String distances act over any pair of iterators that define length (e.g. AbstractStrings, GraphemeIterators, or AbstractVectors)

The available distances are:

Edit Distances
- Hamming Distance Hamming() <: SemiMetric
- Jaro and Jaro-Winkler Distance Jaro() JaroWinkler() <: SemiMetric
- Levenshtein Distance Levenshtein() <: Metric
- Optimal String Alignment Distance (a.k.a. restricted Damerau-Levenshtein) OptimalStringAlignment() <: SemiMetric
- Damerau-Levenshtein Distance DamerauLevenshtein() <: Metric
- RatcliffObershelp Distance RatcliffObershelp() <: SemiMetric
Q-gram distances (which compare the set of all substrings of length q in each string)
- QGram Distance QGram(q::Int) <: SemiMetric
- Cosine Distance Cosine(q::Int) <: SemiMetric
- Jaccard Distance Jaccard(q::Int) <: SemiMetric
- Overlap Distance Overlap(q::Int) <: SemiMetric
- Sorensen-Dice Distance SorensenDice(q::Int) <: SemiMetric
- MorisitaOverlap Distance MorisitaOverlap(q::Int) <: SemiMetric
- Normalized Multiset Distance NMD(q::Int) <: SemiMetric

Syntax

Following the Distances.jl package, string distances can inherit from two abstract types: StringSemiMetric <: SemiMetric or StringMetric <: Metric.

Computing the distance between two strings (or iterators)

You can always compute a certain distance between two strings using the following syntax

r = evaluate(dist, x, y)
r = dist(x, y)

Here, dist is an instance of a distance type: for example, the type for the Levenshtein distance is Levenshtein. You can compute the Levenshtein distance between x and y as

r = evaluate(Levenshtein(), x, y)
r = Levenshtein()(x, y)

The function compare returns the similarity score, defined as 1 minus the normalized distance between two strings. It always returns an element of type Float64. A value of 0.0 means completely different and a value of 1.0 means completely similar.

Levenshtein()("martha", "martha")
#> 0
compare("martha", "martha", Levenshtein())
#> 1.0

Computing the distance between two AbstractVectors of strings (or iterators)

Consider X and Y two AbstractVectors of iterators. You can compute the matrix of distances across elements, dist(X[i], Y[j]), as follows:

pairwise(dist, X, Y)

For instance,

pairwise(Jaccard(3), ["martha", "kitten"], ["marhta", "sitting"])

pairwise is optimized in various ways (e.g., for the case of QGram-distances, dictionary of qgrams are pre-computed)

Find closest string

The package also adds convenience functions to find elements in a iterator of strings closest to a given string

findnearest returns the value and index of the element in itr with the highest similarity score with s. Its syntax is:
```
 findnearest(s, itr, dist)
```
findall returns the indices of all elements in itr with a similarity score with s higher than a minimum score. Its syntax is:
```
 findall(s, itr, dist; min_score = 0.8)
```

The functions findnearest and findall are particularly optimized for the Levenshtein and OptimalStringAlignment distances, as these algorithm can stop early if the distance becomes higher than a certain threshold.

fuzzywuzzy

The package also defines Distance "modifiers" that are inspired by the Python package - fuzzywuzzy. These modifiers are particularly helpful to match strings composed of multiple words (e.g. addresses, company names).

Partial returns the minimum of the distance between the shorter string and substrings of the longer string.
TokenSort adjusts for differences in word orders by returning the distance of the two strings, after re-ordering words alphabetically.
TokenSet adjusts for differences in word orders and word numbers by returning the distance between the intersection of two strings with each string.
TokenMax normalizes the distance, and combine the Partial, TokenSort and TokenSet modifiers, with penalty terms depending on string. TokenMax(Levenshtein()) corresponds to the distance defined in fuzzywuzzy

Levenshtein()("this string", "this string is longer") = 10
Partial(Levenshtein())("this string", "this string is longer") = 0

Notes

All string distances are case sensitive.

stringdistances.jl's People

Contributors

Stargazers

Watchers

Forkers

tkelman pfitzseb zgornel baggepinnen altre sonosole paulstey robertfeldt rebekahanne stjordanis jtrakk aevedis standardgalactic playfloor ararslan fchichorro aerdely adienes

stringdistances.jl's Issues

The value of "compare" is probably wrong.

I understand that to normalize the Levenshtein distance, the value is divided by the longer string length.
However, the results I have calculated using that method are different from the results using the compare function.

julia> using StringDistances
julia> s1 = "martha"
julia> s2 = "marhtai"
julia> Levenshtein()(s1, s2)
3
julia> max(length(s1),length(s2))
7
julia> 1 - Levenshtein()(s1, s2) / 7
0.5714285714285714
julia> compare(s1, s2, Levenshtein())
0.8571428571428572

Why did this result occur?
I would appreciate it if you could tell me about it.

New Release

Hi, would you mind tagging a new release -- even though not that much has happened since 0.3.0 it'd be nice to have the couple of bugfixes :)

unexpected behavior when computing distance with an array

Computing distances with one of the inputs as an array (of anything) does not result in an error, but instead gives a nonsensical result.

For example,

using StringDistances
d = Levenshtein()
d("minimal working example", ["an array", " of strings"])

returns 23, which is the length of the first argument, and not the distance between the first argument and any of the entries in the second argument.

The result is identical (the length of the first argument, at least for Levenshtein) if the array is empty or contains non-string entries.

Unicode error when using TokenMax() and Partial()

Encountered an error when trying to compare a bunch of string in Chinese. This
compare(TokenMax(RatcliffObershelp()), "为人子女者要堂堂正正做人，千万不可作奸犯科，致使父母蒙羞", "此前稍早些时候**商务部发布消息称，中美经贸高级别磋商双方牵头人通话，中方就美拟9月1日加征关税进行了严正交涉。")
or this
compare(Partial(RatcliffObershelp()), "为人子女者要堂堂正正做人，千万不可作奸犯科，致使父母蒙羞", "此前稍早些时候**商务部发布消息称，中美经贸高级别磋商双方牵头人通话，中方就美拟9月1日加征关税进行了严正交涉。")
throw StringIndexError. At the same time, TokenSort, TokenSet or just plain RatcliffObershelp work just fine. Some relic from the times when StringDistances worked only with ASCII strings perhaps?

Non-strings

I see these functions ask for AbstractStrings. I'm curious if that is a necessary restriction, or could they also apply to AbstractVectors and other types?

Mutating version of evaluate

When I was working with DamerauLevenshtein function evaluate I've noticed that it allocates a lot. Problem is with these lines: https://github.com/matthieugomez/StringDistances.jl/blob/master/src/edit.jl#L146-L147

It doesn't look much when you compare only two words, but in my particular case, it was needed to compare thousands of words in a single run and these allocations became a real problem. I rewrote this algorithm by adding two mutable vectors that were reused between calculation and saw almost 80% drop in allocations and 10%-20% increase in speed. If it is interesting I can make a PR, which will add mutating version of this and Levenshtein evaluate function.

New release

@JuliaRegistrator register()

Simpler QGramDistances implementation and prep for general dictionaries and iterators

Matthieu, I checked and you are right that there is little to no performance benefit in splitting the counting for QGramDistances into countleft, countright, and countboth. I thus made a branch which simplifies the implementation by only having a single count method.

I also split the helpers code for QGramDistances into separate files in a helpers subdir and generalized to prepare for other dictionaries (when preprocessing) as well as iterators that have varying length words (q-gram iterators are fixed length).

This new design will be more maintainable going forward and keeps the good performance. For some distances performance even improves since we need not re-calculate information we already have after preprocessing.

Even though there are no new features added there are quite a few changes so before I potentially do a PR feel free to compare to your current master here:

master...robertfeldt:simpler_counter_design

Error tagging new release

The tag name "0.3.1" is not of the appropriate SemVer form (vX.Y.Z).
cc: @matthieugomez

incomplete readme documentation

line 18 ends abruptly:

Q-gram distances compare the set of all substrings of length q in each string (and which

TypeError: collect: in typeassert, expected SubString{SubString{String}}, got SubString{String}

I get a type error when combining TokenMax with qgram measures in some cases:
`julia> StringDistances.compare(TokenMax(Jaccard(2)), "aa", "aa ")

ERROR: TypeError: collect: in typeassert, expected SubString{SubString{String}}, got SubString{String}
Stacktrace:
[1] collect(::StringDistances.QGramIterator{SubString{String},Int64}) at /Users/alanschelten/.julia/v0.6/StringDistances/src/distances/qgram.jl:33
[2] sort at /Users/alanschelten/.julia/v0.6/StringDistances/src/distances/qgram.jl:37 [inlined]
[3] evaluate(::StringDistances.Jaccard{Int64}, ::String, ::SubString{String}) at /Users/alanschelten/.julia/v0.6/StringDistances/src/distances/qgram.jl:85
[4] compare(::StringDistances.Jaccard{Int64}, ::String, ::SubString{String}) at /Users/alanschelten/.julia/v0.6/StringDistances/src/compare.jl:23
[5] compare(::StringDistances.Partial{StringDistances.Jaccard{Int64}}, ::String, ::String) at /Users/alanschelten/.julia/v0.6/StringDistances/src/modifiers/fuzzywuzzy.jl:19
[6] compare(::StringDistances.TokenMax{StringDistances.Jaccard{Int64}}, ::String, ::String) at /Users/alanschelten/.julia/v0.6/StringDistances/src/modifiers/fuzzywuzzy.jl:138`

typos in variable names

In lines 150-151 of fuzzywuzzy.jl, tsor and tser should be ptsor and ptser, respectively.

Caching iterators and/or dictionary-based distances?

Thanks for StringDistances.jl. For my research, I have developed a similar package and only just found yours. There is one difference in our designs in that I "cache" the Qgrams in a Counter dict so that I can later, more quickly, compare the similarity of new strings with sets of old strings. This is useful when searching/optimising for sets of similar or dissimilar strings/objects. Would you be interested in exploring a way to support this type of design also in your package?

NaN (or ArgumentError) from QGram distances for short strings

Some QGram distances today return NaN if one of the input strings are shorter than the q-gram length (q) while others don't:

julia> using StringDistances

julia> isnan(Cosine(2)("", "bb"))
true

julia> isnan(Cosine(2)("a", "bb"))
true

julia> Jaccard(2)("a", "bb")
1.0

julia> filter(d -> isnan(d(2)("", "bb")), [QGram, Cosine, Jaccard, Overlap, SorensenDice, MorisitaOverlap, NMD])
3-element Vector{DataType}:
 Cosine
 Overlap
 MorisitaOverlap

Maybe it is better to have a consistent behaviour for such inputs? Returning an ArgumentError might be better and then the caller has to decide how to handle such situations.

Bit parallel Levenshtein distance

I was wondering whether this library uses the bit parallel speed up tricks from:

https://www.win.tue.nl/~jfg/educ/bit.mat.pdf
https://users.dcc.uchile.cl/~gnavarro/ps/jea06.pdf

They seem very worthwhile.

Speeding up qgram distances with pre-counting of qgrams

Thanks again, for a great package.

I repeatedly need to calculate distances to some set of strings so I tried to speed up by pre-counting of qgrams. This can be very useful when calculating for example distance matrices etc.

I can get about 2.5-3 times speedups after pre-counting. Of course, it is slower (double time on my machine) if you only want to calculate distances once.

If there is any interest in merging this I can try doing a PR at some point, if not here is the code if someone else have a similar need:

https://gist.github.com/robertfeldt/103f078b3154c5621f52cee3d061bf81

I'll try to also use a pre-sorted array of counts rather than a dict and see if I can push this further.

(Partial) Hamming distance

I'm looking for a function for partial hamming distance. It looks like it existed in an earlier version but was taken out.

Is it possible to put it back in?

Julia v1.7 Jaro() doesn't work

Julia v1.7 Jaro() doesn't work on MAC M1 version.

bug in `DamerauLevenshtein`

using StringDistances
a = "qxKH"
b = collect(a); b[2]=rand('a':'z'); b = join(b);
 DamerauLevenshtein()(a,b)
#> 1
 DamerauLevenshtein()(a,b, 1.1)
#>2.1
 DamerauLevenshtein()(a,b, 2)
#>1

Here, DL(a, b, max_dist) should be 1 for any max_dist > 1. But it seems for 1<max_dist<2 we get the wrong answer.

`Partial` only looks at substrings of the same length...

...but sometimes shorter substrings can give better matches:

julia> using StringDistances #v0.10

julia> a = "giant golden crowned flying fox"
"giant golden crowned flying fox"

julia> b = "A giantgolden crownedflying fox slept all day."
"A giantgolden crownedflying fox slept all day."

julia> Partial(DamerauLevenshtein())(a,b)
4

julia> DamerauLevenshtein()(a,b[3:31])
2

julia> b[3:31]
"giantgolden crownedflying fox"

Not really sure if this counts as a bug or not, but I think it should probably be documented at least. But if there was a performant way to match shorter subsequences too, I think that would be ideal.

Levenshtein Distance is a true Metric, not a SemiMetric

It statisfies the triangle inequality.
Here's a nice blog post on that fact.

But in the package we have subtyping StringDistance which subtypes SemiMetric.

Getting this wrong is a problem because, various kinds of neighbours trees (NearestNeighbours.jl) require a true Metric to work.

I suggest that we @deprecate_binding StringDistance SemiMetric
and just use the right type from Distances.jl for each thing directly.
The fact that a particular metric is a StringDistance does not particularly matter.

Feature Request: Parallel processing

I'm running pairwise() on two large vectors and wondered if there was a simple way to distribute this computation across multiple cores?

Safe version of evaluate and compare for strings shorter than Q?

when processing distances between pairs of a large number of strings I sometimes run into problems since evaluate(Jaccard(2), "a", "b") return NaN. Is this the "right" return value or should there be a "safe" version of evaluate that returns max distance (1.0) if the strings are not equal while returning min distance (0.0) if they are the same?

Phonetic distance

In real world cases its common to find string mismatch due to words that sound the same but are written differently like meet-meat, one-won, for-four, ate-eight, and many others.

`compare` with `Partial` distances gives negative answers

julia> compare("ab", "de", Partial(DamerauLevenshtein()))
-1

julia> compare("abc", "de", Partial(DamerauLevenshtein()))
-1

This causes findmax to not work. I think this is due to the normalization changes. This also seems odd:

julia> normalize(Partial(DamerauLevenshtein()))("ab", "cde")
2

julia> normalize(Partial(DamerauLevenshtein()))("ab", "cdef")
2

Problem with Unicode

The indexing into unicode is faulty:

julia> compare(TokenMax(RatcliffObershelp()), "aüa", "aua")
ERROR: StringIndexError("aüa", 3)
Stacktrace:
 [1] string_index_err(::String, ::Int64) at ./strings/string.jl:12
 [2] Type at ./strings/substring.jl:32 [inlined]
 [3] Type at ./strings/substring.jl:38 [inlined]
 [4] SubString(::SubString{String}, ::Int64, ::Int64) at ./strings/substring.jl:44
 [5] matching_blocks!(::Set{Tuple{Int64,Int64,Int64}}, ::SubString{String}, ::SubString{String}, ::Int64, ::Int64) at /Users/alex/.julia/packages/StringDistances/PGDwX/src/distances/RatcliffObershelp.jl:35
 [6] matching_blocks!(::Set{Tuple{Int64,Int64,Int64}}, ::String, ::String, ::Int64, ::Int64) at /Users/alex/.julia/packages/StringDistances/PGDwX/src/distances/RatcliffObershelp.jl:41
 [7] matching_blocks at /Users/alex/.julia/packages/StringDistances/PGDwX/src/distances/RatcliffObershelp.jl:48 [inlined]
 [8] evaluate(::RatcliffObershelp, ::String, ::String) at /Users/alex/.julia/packages/StringDistances/PGDwX/src/distances/RatcliffObershelp.jl:56
 [9] compare at /Users/alex/.julia/packages/StringDistances/PGDwX/src/compare.jl:9 [inlined]
 [10] compare(::TokenMax{RatcliffObershelp}, ::String, ::String) at /Users/alex/.julia/packages/StringDistances/PGDwX/src/compare.jl:181
 [11] top-level scope at none:0

Tag a new version

Could you please register a patch version update with the new Distances.jl compatibility? That would be great, thanks.

Operate on types other than strings

Would it be posible to use this package to operate on vectors, e.g,

evaluate(Levenshtein(), [1,2,3], [1,2,10]) == 1 # They differ in one element

DamerauLevenshtein() vs Levenshtein() why the same distance ?

135/5000
I expected a shorter distance for Damerau Levenshtein because there is an exchange of two letters. Why are there the same results?

julia> compare("martha", "martht", DamerauLevenshtein())
0.8333333333333334

julia> compare("martha", "martht", Levenshtein())
0.8333333333333334

Overflow errors due to UInt8 in CountIteratorDictionary?

I see that you have switched to a dictionary-based count iterator. But how can it be safe to use UInt8s to count the number of occurrences of a q-gram? Since UInt8(255)+UInt8(1) == UInt8(0) this would mean that if one of the strings have more than 255 occurrences of one and the same q-gram all of the QGram distances will give the wrong result.

Maybe better to use the max num of q-grams in the strings (s1 and s2) to select how many bits of the UInt to use?

Weight parameter

A very common use-case is to give different weights to mistakes depending on qwerty keyboard distance.
The type 'confurmation' and the typo 'confprmation' have the same distance to both 'confirmation' and 'conformation'.
But it's more likely that 'confurmation' meant 'confirmation' because 'u' is closer to 'i' in the keyboard.
And it's more likely that 'confprmation' meant 'conformation' because 'p' is closer to 'i' in the keyboard.
It's a silly example but you get the idea.
This is just one of many cases a weighting parameter could help with.

Define `result_type` for string distances

I'd like to use result_type in a package that's supposed to work on arbitrary subtypes of PreMetric, but this isn't defined for any of the string distances here. It should be relatively straightforward to add, for reference: Distances.jl.

`Base.findmin(s1, s2, dist::Partial)`

Hi and thanks for the great package!

I found the following snippet (modified from (dist::Partial)(s1, s2)) useful and I was wondering if it would be worth adding to StringDistances? I could make a PR if so.

julia> function Base.findmin(s1, s2, dist::Partial; max_dist = 1.0)
    s1, s2 = StringDistances.reorder(s1, s2)
    len1, len2 = length(s1), length(s2)
    len1 == len2 && return dist.dist(s1, s2, max_dist), firstindex(s2):lastindex(s2)
    len1 == 0 && return max_dist+1, 1:0
    out = max_dist+1
    out_idx = 0
    for (i, x) in enumerate(qgrams(s2, len1))
        curr = dist.dist(s1, x, max_dist)
        out_idx = ifelse(curr < out, i, out_idx)
        out = min(out, curr)
        max_dist = min(out, max_dist)
    end
    return out, nextind(s2, 0, out_idx):nextind(s2, 0, out_idx+len1-1)
end

julia>  findmin("βabc", "βadcacαaXXcαγ", Partial(DamerauLevenshtein()))
(0.25, 1:5)

julia> "βadcacαaXXcαγ"[1:5]
"βadc"

Also, I am new to working with unicode strings, so it's possible I haven't used nextind correctly.

Ability to stop comparison when threshold is met?

Hi Matthieu,

Thanks for the awesome package! I was wondering if it would be possible to implement the ability to stop the comparison when a threshold is met? For example, if I'm only interested in filtering by strings where the result of compare is > 0.8, I'm wondering if it would be possible to know earlier whether or not two strings are going to exceed the threshold, and then bail out on the comparison?

See this thread on discourse for background.

Basically, I'm doing comparisons on arrays that contain hundreds of thousands of strings, so I'm looking for ways to optimize the comparison so that it doesn't take so long.

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

pairwise not working with StringDistances

In the following example, the pairwise function from Distances.jl does not work.

Versions:
* Julia: 1.5.1
* Distances: 0.9.2
* StringDistances: 0.8.0

julia> using Distances, StringDistances

julia> pairwise(Jaro(), ["hello" "world"])
ERROR: MethodError: result_type(::Jaro, ::Array{String,2}, ::Array{String,2}) is ambiguous. Candidates:
  result_type(dist::PreMetric, a::AbstractArray, b::AbstractArray) in Distances at /home/simon/.julia/packages/Distances/1CdE2/src/generic.jl:36
  result_type(dist::Union{DamerauLevenshtein, Jaro, Levenshtein, RatcliffObershelp, StringDistances.QGramDistance, StringDistances.Normalize, Partial, TokenMax, TokenSet, TokenSort, Winkler}, s1, s2) in StringDistances at /home/simon/.julia/packages/StringDistances/67cTd/src/StringDistances.jl:12
Possible fix, define
  result_type(::Union{DamerauLevenshtein, Jaro, Levenshtein, RatcliffObershelp, StringDistances.QGramDistance, StringDistances.Normalize, Partial, TokenMax, TokenSet, TokenSort, Winkler}, ::AbstractArray, ::AbstractArray)
Stacktrace:
 [1] pairwise(::Jaro, ::Array{String,2}; dims::Nothing) at /home/simon/.julia/packages/Distances/1CdE2/src/generic.jl:212
 [2] pairwise(::Jaro, ::Array{String,2}) at /home/simon/.julia/packages/Distances/1CdE2/src/generic.jl:209
 [3] top-level scope at REPL[7]:1

I suspect that the reason for this ambiguity is making StringDistance a union, see #19

incremental compilation may be fatally broken for this module

Running on Julia 1.5.4
StringDistances 0.11.0

When I try to precompile my project or build a SysImage that includes StringDistances, I get this warning:

WARNING: Method definition (::Type{StringDistances.Normalized{T} where T<:Union{StringDistances.StringMetric, StringDistances.StringSemiMetric}})(Union{StringDistances.StringMetric, StringDistances.StringSemiMetric}) in module StringDistances at /home/ubuntu/.julia/packages/StringDistances/LNzpw/src/normalize.jl:19 overwritten at /home/ubuntu/.julia/packages/StringDistances/LNzpw/src/normalize.jl:21.
  ** incremental compilation may be fatally broken for this module **

The lines in question are:

18. struct Normalized{T <: Union{StringSemiMetric, StringMetric}} <: StringSemiMetric
19.     dist::T
20. end
21. Normalized(dist::Union{StringSemiMetric, StringMetric}) = Normalized{typeof(dist)}(dist)

I believe in Julia 1.5, the constructor on line 21 is identical to the default constructor.