matthieugomez / stringdistances.jl Goto Github PK

String Distances in Julia

License: Other

Julia 100.00%

string-distance levenshtein-distance similarity-score edit-distances winkler qgram-distance fuzzywuzzy

stringdistances.jl's Issues

Caching iterators and/or dictionary-based distances?

Thanks for StringDistances.jl. For my research, I have developed a similar package and only just found yours. There is one difference in our designs in that I "cache" the Qgrams in a Counter dict so that I can later, more quickly, compare the similarity of new strings with sets of old strings. This is useful when searching/optimising for sets of similar or dissimilar strings/objects. Would you be interested in exploring a way to support this type of design also in your package?

Bit parallel Levenshtein distance

I was wondering whether this library uses the bit parallel speed up tricks from:

https://www.win.tue.nl/~jfg/educ/bit.mat.pdf
https://users.dcc.uchile.cl/~gnavarro/ps/jea06.pdf

They seem very worthwhile.

typos in variable names

In lines 150-151 of fuzzywuzzy.jl, tsor and tser should be ptsor and ptser, respectively.

The value of "compare" is probably wrong.

I understand that to normalize the Levenshtein distance, the value is divided by the longer string length.
However, the results I have calculated using that method are different from the results using the compare function.

julia> using StringDistances
julia> s1 = "martha"
julia> s2 = "marhtai"
julia> Levenshtein()(s1, s2)
3
julia> max(length(s1),length(s2))
7
julia> 1 - Levenshtein()(s1, s2) / 7
0.5714285714285714
julia> compare(s1, s2, Levenshtein())
0.8571428571428572

Why did this result occur?
I would appreciate it if you could tell me about it.

Problem with Unicode

The indexing into unicode is faulty:

julia> compare(TokenMax(RatcliffObershelp()), "aüa", "aua")
ERROR: StringIndexError("aüa", 3)
Stacktrace:
 [1] string_index_err(::String, ::Int64) at ./strings/string.jl:12
 [2] Type at ./strings/substring.jl:32 [inlined]
 [3] Type at ./strings/substring.jl:38 [inlined]
 [4] SubString(::SubString{String}, ::Int64, ::Int64) at ./strings/substring.jl:44
 [5] matching_blocks!(::Set{Tuple{Int64,Int64,Int64}}, ::SubString{String}, ::SubString{String}, ::Int64, ::Int64) at /Users/alex/.julia/packages/StringDistances/PGDwX/src/distances/RatcliffObershelp.jl:35
 [6] matching_blocks!(::Set{Tuple{Int64,Int64,Int64}}, ::String, ::String, ::Int64, ::Int64) at /Users/alex/.julia/packages/StringDistances/PGDwX/src/distances/RatcliffObershelp.jl:41
 [7] matching_blocks at /Users/alex/.julia/packages/StringDistances/PGDwX/src/distances/RatcliffObershelp.jl:48 [inlined]
 [8] evaluate(::RatcliffObershelp, ::String, ::String) at /Users/alex/.julia/packages/StringDistances/PGDwX/src/distances/RatcliffObershelp.jl:56
 [9] compare at /Users/alex/.julia/packages/StringDistances/PGDwX/src/compare.jl:9 [inlined]
 [10] compare(::TokenMax{RatcliffObershelp}, ::String, ::String) at /Users/alex/.julia/packages/StringDistances/PGDwX/src/compare.jl:181
 [11] top-level scope at none:0

Ability to stop comparison when threshold is met?

Hi Matthieu,

Thanks for the awesome package! I was wondering if it would be possible to implement the ability to stop the comparison when a threshold is met? For example, if I'm only interested in filtering by strings where the result of compare is > 0.8, I'm wondering if it would be possible to know earlier whether or not two strings are going to exceed the threshold, and then bail out on the comparison?

See this thread on discourse for background.

Basically, I'm doing comparisons on arrays that contain hundreds of thousands of strings, so I'm looking for ways to optimize the comparison so that it doesn't take so long.

Tag a new version

Could you please register a patch version update with the new Distances.jl compatibility? That would be great, thanks.

Feature Request: Parallel processing

I'm running pairwise() on two large vectors and wondered if there was a simple way to distribute this computation across multiple cores?

`Base.findmin(s1, s2, dist::Partial)`

Hi and thanks for the great package!

I found the following snippet (modified from (dist::Partial)(s1, s2)) useful and I was wondering if it would be worth adding to StringDistances? I could make a PR if so.

julia> function Base.findmin(s1, s2, dist::Partial; max_dist = 1.0)
    s1, s2 = StringDistances.reorder(s1, s2)
    len1, len2 = length(s1), length(s2)
    len1 == len2 && return dist.dist(s1, s2, max_dist), firstindex(s2):lastindex(s2)
    len1 == 0 && return max_dist+1, 1:0
    out = max_dist+1
    out_idx = 0
    for (i, x) in enumerate(qgrams(s2, len1))
        curr = dist.dist(s1, x, max_dist)
        out_idx = ifelse(curr < out, i, out_idx)
        out = min(out, curr)
        max_dist = min(out, max_dist)
    end
    return out, nextind(s2, 0, out_idx):nextind(s2, 0, out_idx+len1-1)
end

julia>  findmin("βabc", "βadcacαaXXcαγ", Partial(DamerauLevenshtein()))
(0.25, 1:5)

julia> "βadcacαaXXcαγ"[1:5]
"βadc"

Also, I am new to working with unicode strings, so it's possible I haven't used nextind correctly.

Overflow errors due to UInt8 in CountIteratorDictionary?

I see that you have switched to a dictionary-based count iterator. But how can it be safe to use UInt8s to count the number of occurrences of a q-gram? Since UInt8(255)+UInt8(1) == UInt8(0) this would mean that if one of the strings have more than 255 occurrences of one and the same q-gram all of the QGram distances will give the wrong result.

Maybe better to use the max num of q-grams in the strings (s1 and s2) to select how many bits of the UInt to use?

unexpected behavior when computing distance with an array

Computing distances with one of the inputs as an array (of anything) does not result in an error, but instead gives a nonsensical result.

For example,

using StringDistances
d = Levenshtein()
d("minimal working example", ["an array", " of strings"])

returns 23, which is the length of the first argument, and not the distance between the first argument and any of the entries in the second argument.

The result is identical (the length of the first argument, at least for Levenshtein) if the array is empty or contains non-string entries.

pairwise not working with StringDistances

In the following example, the pairwise function from Distances.jl does not work.

Versions:
* Julia: 1.5.1
* Distances: 0.9.2
* StringDistances: 0.8.0

julia> using Distances, StringDistances

julia> pairwise(Jaro(), ["hello" "world"])
ERROR: MethodError: result_type(::Jaro, ::Array{String,2}, ::Array{String,2}) is ambiguous. Candidates:
  result_type(dist::PreMetric, a::AbstractArray, b::AbstractArray) in Distances at /home/simon/.julia/packages/Distances/1CdE2/src/generic.jl:36
  result_type(dist::Union{DamerauLevenshtein, Jaro, Levenshtein, RatcliffObershelp, StringDistances.QGramDistance, StringDistances.Normalize, Partial, TokenMax, TokenSet, TokenSort, Winkler}, s1, s2) in StringDistances at /home/simon/.julia/packages/StringDistances/67cTd/src/StringDistances.jl:12
Possible fix, define
  result_type(::Union{DamerauLevenshtein, Jaro, Levenshtein, RatcliffObershelp, StringDistances.QGramDistance, StringDistances.Normalize, Partial, TokenMax, TokenSet, TokenSort, Winkler}, ::AbstractArray, ::AbstractArray)
Stacktrace:
 [1] pairwise(::Jaro, ::Array{String,2}; dims::Nothing) at /home/simon/.julia/packages/Distances/1CdE2/src/generic.jl:212
 [2] pairwise(::Jaro, ::Array{String,2}) at /home/simon/.julia/packages/Distances/1CdE2/src/generic.jl:209
 [3] top-level scope at REPL[7]:1

I suspect that the reason for this ambiguity is making StringDistance a union, see #19

Levenshtein Distance is a true Metric, not a SemiMetric

It statisfies the triangle inequality.
Here's a nice blog post on that fact.

But in the package we have subtyping StringDistance which subtypes SemiMetric.

Getting this wrong is a problem because, various kinds of neighbours trees (NearestNeighbours.jl) require a true Metric to work.

I suggest that we @deprecate_binding StringDistance SemiMetric
and just use the right type from Distances.jl for each thing directly.
The fact that a particular metric is a StringDistance does not particularly matter.

Simpler QGramDistances implementation and prep for general dictionaries and iterators

Matthieu, I checked and you are right that there is little to no performance benefit in splitting the counting for QGramDistances into countleft, countright, and countboth. I thus made a branch which simplifies the implementation by only having a single count method.

I also split the helpers code for QGramDistances into separate files in a helpers subdir and generalized to prepare for other dictionaries (when preprocessing) as well as iterators that have varying length words (q-gram iterators are fixed length).

This new design will be more maintainable going forward and keeps the good performance. For some distances performance even improves since we need not re-calculate information we already have after preprocessing.

Even though there are no new features added there are quite a few changes so before I potentially do a PR feel free to compare to your current master here:

master...robertfeldt:simpler_counter_design

bug in `DamerauLevenshtein`

using StringDistances
a = "qxKH"
b = collect(a); b[2]=rand('a':'z'); b = join(b);
 DamerauLevenshtein()(a,b)
#> 1
 DamerauLevenshtein()(a,b, 1.1)
#>2.1
 DamerauLevenshtein()(a,b, 2)
#>1

Here, DL(a, b, max_dist) should be 1 for any max_dist > 1. But it seems for 1<max_dist<2 we get the wrong answer.

NaN (or ArgumentError) from QGram distances for short strings

Some QGram distances today return NaN if one of the input strings are shorter than the q-gram length (q) while others don't:

julia> using StringDistances

julia> isnan(Cosine(2)("", "bb"))
true

julia> isnan(Cosine(2)("a", "bb"))
true

julia> Jaccard(2)("a", "bb")
1.0

julia> filter(d -> isnan(d(2)("", "bb")), [QGram, Cosine, Jaccard, Overlap, SorensenDice, MorisitaOverlap, NMD])
3-element Vector{DataType}:
 Cosine
 Overlap
 MorisitaOverlap

Maybe it is better to have a consistent behaviour for such inputs? Returning an ArgumentError might be better and then the caller has to decide how to handle such situations.

New Release

Hi, would you mind tagging a new release -- even though not that much has happened since 0.3.0 it'd be nice to have the couple of bugfixes :)

Julia v1.7 Jaro() doesn't work

Julia v1.7 Jaro() doesn't work on MAC M1 version.

`compare` with `Partial` distances gives negative answers

julia> compare("ab", "de", Partial(DamerauLevenshtein()))
-1

julia> compare("abc", "de", Partial(DamerauLevenshtein()))
-1

This causes findmax to not work. I think this is due to the normalization changes. This also seems odd:

julia> normalize(Partial(DamerauLevenshtein()))("ab", "cde")
2

julia> normalize(Partial(DamerauLevenshtein()))("ab", "cdef")
2

(Partial) Hamming distance

I'm looking for a function for partial hamming distance. It looks like it existed in an earlier version but was taken out.

Is it possible to put it back in?

TypeError: collect: in typeassert, expected SubString{SubString{String}}, got SubString{String}

I get a type error when combining TokenMax with qgram measures in some cases:
`julia> StringDistances.compare(TokenMax(Jaccard(2)), "aa", "aa ")

ERROR: TypeError: collect: in typeassert, expected SubString{SubString{String}}, got SubString{String}
Stacktrace:
[1] collect(::StringDistances.QGramIterator{SubString{String},Int64}) at /Users/alanschelten/.julia/v0.6/StringDistances/src/distances/qgram.jl:33
[2] sort at /Users/alanschelten/.julia/v0.6/StringDistances/src/distances/qgram.jl:37 [inlined]
[3] evaluate(::StringDistances.Jaccard{Int64}, ::String, ::SubString{String}) at /Users/alanschelten/.julia/v0.6/StringDistances/src/distances/qgram.jl:85
[4] compare(::StringDistances.Jaccard{Int64}, ::String, ::SubString{String}) at /Users/alanschelten/.julia/v0.6/StringDistances/src/compare.jl:23
[5] compare(::StringDistances.Partial{StringDistances.Jaccard{Int64}}, ::String, ::String) at /Users/alanschelten/.julia/v0.6/StringDistances/src/modifiers/fuzzywuzzy.jl:19
[6] compare(::StringDistances.TokenMax{StringDistances.Jaccard{Int64}}, ::String, ::String) at /Users/alanschelten/.julia/v0.6/StringDistances/src/modifiers/fuzzywuzzy.jl:138`

`Partial` only looks at substrings of the same length...

...but sometimes shorter substrings can give better matches:

julia> using StringDistances #v0.10

julia> a = "giant golden crowned flying fox"
"giant golden crowned flying fox"

julia> b = "A giantgolden crownedflying fox slept all day."
"A giantgolden crownedflying fox slept all day."

julia> Partial(DamerauLevenshtein())(a,b)
4

julia> DamerauLevenshtein()(a,b[3:31])
2

julia> b[3:31]
"giantgolden crownedflying fox"

Not really sure if this counts as a bug or not, but I think it should probably be documented at least. But if there was a performant way to match shorter subsequences too, I think that would be ideal.

Mutating version of evaluate

When I was working with DamerauLevenshtein function evaluate I've noticed that it allocates a lot. Problem is with these lines: https://github.com/matthieugomez/StringDistances.jl/blob/master/src/edit.jl#L146-L147

It doesn't look much when you compare only two words, but in my particular case, it was needed to compare thousands of words in a single run and these allocations became a real problem. I rewrote this algorithm by adding two mutable vectors that were reused between calculation and saw almost 80% drop in allocations and 10%-20% increase in speed. If it is interesting I can make a PR, which will add mutating version of this and Levenshtein evaluate function.

Non-strings

I see these functions ask for AbstractStrings. I'm curious if that is a necessary restriction, or could they also apply to AbstractVectors and other types?

Define `result_type` for string distances

I'd like to use result_type in a package that's supposed to work on arbitrary subtypes of PreMetric, but this isn't defined for any of the string distances here. It should be relatively straightforward to add, for reference: Distances.jl.

Operate on types other than strings

Would it be posible to use this package to operate on vectors, e.g,

evaluate(Levenshtein(), [1,2,3], [1,2,10]) == 1 # They differ in one element

Unicode error when using TokenMax() and Partial()

Encountered an error when trying to compare a bunch of string in Chinese. This
compare(TokenMax(RatcliffObershelp()), "为人子女者要堂堂正正做人，千万不可作奸犯科，致使父母蒙羞", "此前稍早些时候**商务部发布消息称，中美经贸高级别磋商双方牵头人通话，中方就美拟9月1日加征关税进行了严正交涉。")
or this
compare(Partial(RatcliffObershelp()), "为人子女者要堂堂正正做人，千万不可作奸犯科，致使父母蒙羞", "此前稍早些时候**商务部发布消息称，中美经贸高级别磋商双方牵头人通话，中方就美拟9月1日加征关税进行了严正交涉。")
throw StringIndexError. At the same time, TokenSort, TokenSet or just plain RatcliffObershelp work just fine. Some relic from the times when StringDistances worked only with ASCII strings perhaps?

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

Speeding up qgram distances with pre-counting of qgrams

Thanks again, for a great package.

I repeatedly need to calculate distances to some set of strings so I tried to speed up by pre-counting of qgrams. This can be very useful when calculating for example distance matrices etc.

I can get about 2.5-3 times speedups after pre-counting. Of course, it is slower (double time on my machine) if you only want to calculate distances once.

If there is any interest in merging this I can try doing a PR at some point, if not here is the code if someone else have a similar need:

https://gist.github.com/robertfeldt/103f078b3154c5621f52cee3d061bf81

I'll try to also use a pre-sorted array of counts rather than a dict and see if I can push this further.

DamerauLevenshtein() vs Levenshtein() why the same distance ?

135/5000
I expected a shorter distance for Damerau Levenshtein because there is an exchange of two letters. Why are there the same results?

julia> compare("martha", "martht", DamerauLevenshtein())
0.8333333333333334

julia> compare("martha", "martht", Levenshtein())
0.8333333333333334

Weight parameter

A very common use-case is to give different weights to mistakes depending on qwerty keyboard distance.
The type 'confurmation' and the typo 'confprmation' have the same distance to both 'confirmation' and 'conformation'.
But it's more likely that 'confurmation' meant 'confirmation' because 'u' is closer to 'i' in the keyboard.
And it's more likely that 'confprmation' meant 'conformation' because 'p' is closer to 'i' in the keyboard.
It's a silly example but you get the idea.
This is just one of many cases a weighting parameter could help with.

Error tagging new release

The tag name "0.3.1" is not of the appropriate SemVer form (vX.Y.Z).
cc: @matthieugomez

incremental compilation may be fatally broken for this module

Running on Julia 1.5.4
StringDistances 0.11.0

When I try to precompile my project or build a SysImage that includes StringDistances, I get this warning:

WARNING: Method definition (::Type{StringDistances.Normalized{T} where T<:Union{StringDistances.StringMetric, StringDistances.StringSemiMetric}})(Union{StringDistances.StringMetric, StringDistances.StringSemiMetric}) in module StringDistances at /home/ubuntu/.julia/packages/StringDistances/LNzpw/src/normalize.jl:19 overwritten at /home/ubuntu/.julia/packages/StringDistances/LNzpw/src/normalize.jl:21.
  ** incremental compilation may be fatally broken for this module **

The lines in question are:

18. struct Normalized{T <: Union{StringSemiMetric, StringMetric}} <: StringSemiMetric
19.     dist::T
20. end
21. Normalized(dist::Union{StringSemiMetric, StringMetric}) = Normalized{typeof(dist)}(dist)

I believe in Julia 1.5, the constructor on line 21 is identical to the default constructor.

Phonetic distance

In real world cases its common to find string mismatch due to words that sound the same but are written differently like meet-meat, one-won, for-four, ate-eight, and many others.

incomplete readme documentation

line 18 ends abruptly:

Q-gram distances compare the set of all substrings of length q in each string (and which

New release

@JuliaRegistrator register()

Safe version of evaluate and compare for strings shorter than Q?

when processing distances between pairs of a large number of strings I sometimes run into problems since evaluate(Jaccard(2), "a", "b") return NaN. Is this the "right" return value or should there be a "safe" version of evaluate that returns max distance (1.0) if the strings are not equal while returning min distance (0.0) if they are the same?

matthieugomez / stringdistances.jl Goto Github PK

stringdistances.jl's Issues

Recommend Projects

Recommend Topics

Recommend Org