matthieugomez / stringdistances.jl Goto Github PK
View Code? Open in Web Editor NEWString Distances in Julia
License: Other
String Distances in Julia
License: Other
Hi and thanks for the great package!
I found the following snippet (modified from (dist::Partial)(s1, s2)
) useful and I was wondering if it would be worth adding to StringDistances? I could make a PR if so.
julia> function Base.findmin(s1, s2, dist::Partial; max_dist = 1.0)
s1, s2 = StringDistances.reorder(s1, s2)
len1, len2 = length(s1), length(s2)
len1 == len2 && return dist.dist(s1, s2, max_dist), firstindex(s2):lastindex(s2)
len1 == 0 && return max_dist+1, 1:0
out = max_dist+1
out_idx = 0
for (i, x) in enumerate(qgrams(s2, len1))
curr = dist.dist(s1, x, max_dist)
out_idx = ifelse(curr < out, i, out_idx)
out = min(out, curr)
max_dist = min(out, max_dist)
end
return out, nextind(s2, 0, out_idx):nextind(s2, 0, out_idx+len1-1)
end
julia> findmin("βabc", "βadcacαaXXcαγ", Partial(DamerauLevenshtein()))
(0.25, 1:5)
julia> "βadcacαaXXcαγ"[1:5]
"βadc"
Also, I am new to working with unicode strings, so it's possible I haven't used nextind
correctly.
It statisfies the triangle inequality.
Here's a nice blog post on that fact.
But in the package we have subtyping StringDistance
which subtypes SemiMetric
.
Getting this wrong is a problem because, various kinds of neighbours trees (NearestNeighbours.jl) require a true Metric
to work.
I suggest that we @deprecate_binding StringDistance SemiMetric
and just use the right type from Distances.jl for each thing directly.
The fact that a particular metric is a StringDistance does not particularly matter.
Matthieu, I checked and you are right that there is little to no performance benefit in splitting the counting for QGramDistances into countleft
, countright
, and countboth
. I thus made a branch which simplifies the implementation by only having a single count
method.
I also split the helpers code for QGramDistances into separate files in a helpers
subdir and generalized to prepare for other dictionaries (when preprocessing) as well as iterators that have varying length words (q-gram iterators are fixed length).
This new design will be more maintainable going forward and keeps the good performance. For some distances performance even improves since we need not re-calculate information we already have after preprocessing.
Even though there are no new features added there are quite a few changes so before I potentially do a PR feel free to compare to your current master here:
Thanks for StringDistances.jl. For my research, I have developed a similar package and only just found yours. There is one difference in our designs in that I "cache" the Qgrams in a Counter dict so that I can later, more quickly, compare the similarity of new strings with sets of old strings. This is useful when searching/optimising for sets of similar or dissimilar strings/objects. Would you be interested in exploring a way to support this type of design also in your package?
julia> compare("ab", "de", Partial(DamerauLevenshtein()))
-1
julia> compare("abc", "de", Partial(DamerauLevenshtein()))
-1
This causes findmax
to not work. I think this is due to the normalization changes. This also seems odd:
julia> normalize(Partial(DamerauLevenshtein()))("ab", "cde")
2
julia> normalize(Partial(DamerauLevenshtein()))("ab", "cdef")
2
I understand that to normalize the Levenshtein distance, the value is divided by the longer string length.
However, the results I have calculated using that method are different from the results using the compare function.
julia> using StringDistances
julia> s1 = "martha"
julia> s2 = "marhtai"
julia> Levenshtein()(s1, s2)
3
julia> max(length(s1),length(s2))
7
julia> 1 - Levenshtein()(s1, s2) / 7
0.5714285714285714
julia> compare(s1, s2, Levenshtein())
0.8571428571428572
Why did this result occur?
I would appreciate it if you could tell me about it.
I get a type error when combining TokenMax with qgram measures in some cases:
`julia> StringDistances.compare(TokenMax(Jaccard(2)), "aa", "aa ")
ERROR: TypeError: collect: in typeassert, expected SubString{SubString{String}}, got SubString{String}
Stacktrace:
[1] collect(::StringDistances.QGramIterator{SubString{String},Int64}) at /Users/alanschelten/.julia/v0.6/StringDistances/src/distances/qgram.jl:33
[2] sort at /Users/alanschelten/.julia/v0.6/StringDistances/src/distances/qgram.jl:37 [inlined]
[3] evaluate(::StringDistances.Jaccard{Int64}, ::String, ::SubString{String}) at /Users/alanschelten/.julia/v0.6/StringDistances/src/distances/qgram.jl:85
[4] compare(::StringDistances.Jaccard{Int64}, ::String, ::SubString{String}) at /Users/alanschelten/.julia/v0.6/StringDistances/src/compare.jl:23
[5] compare(::StringDistances.Partial{StringDistances.Jaccard{Int64}}, ::String, ::String) at /Users/alanschelten/.julia/v0.6/StringDistances/src/modifiers/fuzzywuzzy.jl:19
[6] compare(::StringDistances.TokenMax{StringDistances.Jaccard{Int64}}, ::String, ::String) at /Users/alanschelten/.julia/v0.6/StringDistances/src/modifiers/fuzzywuzzy.jl:138`
I see these functions ask for AbstractString
s. I'm curious if that is a necessary restriction, or could they also apply to AbstractVector
s and other types?
135/5000
I expected a shorter distance for Damerau Levenshtein because there is an exchange of two letters. Why are there the same results?
julia> compare("martha", "martht", DamerauLevenshtein())
0.8333333333333334
julia> compare("martha", "martht", Levenshtein())
0.8333333333333334
Some QGram distances today return NaN if one of the input strings are shorter than the q-gram length (q) while others don't:
julia> using StringDistances
julia> isnan(Cosine(2)("", "bb"))
true
julia> isnan(Cosine(2)("a", "bb"))
true
julia> Jaccard(2)("a", "bb")
1.0
julia> filter(d -> isnan(d(2)("", "bb")), [QGram, Cosine, Jaccard, Overlap, SorensenDice, MorisitaOverlap, NMD])
3-element Vector{DataType}:
Cosine
Overlap
MorisitaOverlap
Maybe it is better to have a consistent behaviour for such inputs? Returning an ArgumentError might be better and then the caller has to decide how to handle such situations.
Could you please register a patch version update with the new Distances.jl
compatibility? That would be great, thanks.
I'm running pairwise()
on two large vectors and wondered if there was a simple way to distribute this computation across multiple cores?
when processing distances between pairs of a large number of strings I sometimes run into problems since evaluate(Jaccard(2), "a", "b")
return NaN
. Is this the "right" return value or should there be a "safe" version of evaluate that returns max distance (1.0) if the strings are not equal while returning min distance (0.0) if they are the same?
I'm looking for a function for partial hamming distance. It looks like it existed in an earlier version but was taken out.
Is it possible to put it back in?
A very common use-case is to give different weights to mistakes depending on qwerty keyboard distance.
The type 'confurmation' and the typo 'confprmation' have the same distance to both 'confirmation' and 'conformation'.
But it's more likely that 'confurmation' meant 'confirmation' because 'u' is closer to 'i' in the keyboard.
And it's more likely that 'confprmation' meant 'conformation' because 'p' is closer to 'i' in the keyboard.
It's a silly example but you get the idea.
This is just one of many cases a weighting parameter could help with.
line 18 ends abruptly:
q
in each string (and whichThanks again, for a great package.
I repeatedly need to calculate distances to some set of strings so I tried to speed up by pre-counting of qgrams. This can be very useful when calculating for example distance matrices etc.
I can get about 2.5-3 times speedups after pre-counting. Of course, it is slower (double time on my machine) if you only want to calculate distances once.
If there is any interest in merging this I can try doing a PR at some point, if not here is the code if someone else have a similar need:
https://gist.github.com/robertfeldt/103f078b3154c5621f52cee3d061bf81
I'll try to also use a pre-sorted array of counts rather than a dict and see if I can push this further.
This issue is used to trigger TagBot; feel free to unsubscribe.
If you haven't already, you should update your TagBot.yml
to include issue comment triggers.
Please see this post on Discourse for instructions and more details.
Encountered an error when trying to compare a bunch of string in Chinese. This
compare(TokenMax(RatcliffObershelp()), "为人子女者要堂堂正正做人,千万不可作奸犯科,致使父母蒙羞", "此前稍早些时候**商务部发布消息称,中美经贸高级别磋商双方牵头人通话,中方就美拟9月1日加征关税进行了严正交涉。")
or this
compare(Partial(RatcliffObershelp()), "为人子女者要堂堂正正做人,千万不可作奸犯科,致使父母蒙羞", "此前稍早些时候**商务部发布消息称,中美经贸高级别磋商双方牵头人通话,中方就美拟9月1日加征关税进行了严正交涉。")
throw StringIndexError. At the same time, TokenSort, TokenSet or just plain RatcliffObershelp work just fine. Some relic from the times when StringDistances worked only with ASCII strings perhaps?
using StringDistances
a = "qxKH"
b = collect(a); b[2]=rand('a':'z'); b = join(b);
DamerauLevenshtein()(a,b)
#> 1
DamerauLevenshtein()(a,b, 1.1)
#>2.1
DamerauLevenshtein()(a,b, 2)
#>1
Here, DL(a, b, max_dist)
should be 1
for any max_dist > 1
. But it seems for 1<max_dist<2
we get the wrong answer.
Running on Julia 1.5.4
StringDistances 0.11.0
When I try to precompile my project or build a SysImage that includes StringDistances, I get this warning:
WARNING: Method definition (::Type{StringDistances.Normalized{T} where T<:Union{StringDistances.StringMetric, StringDistances.StringSemiMetric}})(Union{StringDistances.StringMetric, StringDistances.StringSemiMetric}) in module StringDistances at /home/ubuntu/.julia/packages/StringDistances/LNzpw/src/normalize.jl:19 overwritten at /home/ubuntu/.julia/packages/StringDistances/LNzpw/src/normalize.jl:21.
** incremental compilation may be fatally broken for this module **
The lines in question are:
18. struct Normalized{T <: Union{StringSemiMetric, StringMetric}} <: StringSemiMetric
19. dist::T
20. end
21. Normalized(dist::Union{StringSemiMetric, StringMetric}) = Normalized{typeof(dist)}(dist)
I believe in Julia 1.5, the constructor on line 21 is identical to the default constructor.
...but sometimes shorter substrings can give better matches:
julia> using StringDistances #v0.10
julia> a = "giant golden crowned flying fox"
"giant golden crowned flying fox"
julia> b = "A giantgolden crownedflying fox slept all day."
"A giantgolden crownedflying fox slept all day."
julia> Partial(DamerauLevenshtein())(a,b)
4
julia> DamerauLevenshtein()(a,b[3:31])
2
julia> b[3:31]
"giantgolden crownedflying fox"
Not really sure if this counts as a bug or not, but I think it should probably be documented at least. But if there was a performant way to match shorter subsequences too, I think that would be ideal.
The tag name "0.3.1" is not of the appropriate SemVer form (vX.Y.Z).
cc: @matthieugomez
Hi, would you mind tagging a new release -- even though not that much has happened since 0.3.0 it'd be nice to have the couple of bugfixes :)
I'd like to use result_type
in a package that's supposed to work on arbitrary subtypes of PreMetric
, but this isn't defined for any of the string distances here. It should be relatively straightforward to add, for reference: Distances.jl.
In lines 150-151 of fuzzywuzzy.jl, tsor and tser should be ptsor and ptser, respectively.
Hi Matthieu,
Thanks for the awesome package! I was wondering if it would be possible to implement the ability to stop the comparison when a threshold is met? For example, if I'm only interested in filtering by strings where the result of compare
is > 0.8, I'm wondering if it would be possible to know earlier whether or not two strings are going to exceed the threshold, and then bail out on the comparison?
See this thread on discourse for background.
Basically, I'm doing comparisons on arrays that contain hundreds of thousands of strings, so I'm looking for ways to optimize the comparison so that it doesn't take so long.
Julia v1.7 Jaro() doesn't work on MAC M1 version.
Computing distances with one of the inputs as an array (of anything) does not result in an error, but instead gives a nonsensical result.
For example,
using StringDistances
d = Levenshtein()
d("minimal working example", ["an array", " of strings"])
returns 23
, which is the length of the first argument, and not the distance between the first argument and any of the entries in the second argument.
The result is identical (the length of the first argument, at least for Levenshtein) if the array is empty or contains non-string entries.
The indexing into unicode is faulty:
julia> compare(TokenMax(RatcliffObershelp()), "aüa", "aua")
ERROR: StringIndexError("aüa", 3)
Stacktrace:
[1] string_index_err(::String, ::Int64) at ./strings/string.jl:12
[2] Type at ./strings/substring.jl:32 [inlined]
[3] Type at ./strings/substring.jl:38 [inlined]
[4] SubString(::SubString{String}, ::Int64, ::Int64) at ./strings/substring.jl:44
[5] matching_blocks!(::Set{Tuple{Int64,Int64,Int64}}, ::SubString{String}, ::SubString{String}, ::Int64, ::Int64) at /Users/alex/.julia/packages/StringDistances/PGDwX/src/distances/RatcliffObershelp.jl:35
[6] matching_blocks!(::Set{Tuple{Int64,Int64,Int64}}, ::String, ::String, ::Int64, ::Int64) at /Users/alex/.julia/packages/StringDistances/PGDwX/src/distances/RatcliffObershelp.jl:41
[7] matching_blocks at /Users/alex/.julia/packages/StringDistances/PGDwX/src/distances/RatcliffObershelp.jl:48 [inlined]
[8] evaluate(::RatcliffObershelp, ::String, ::String) at /Users/alex/.julia/packages/StringDistances/PGDwX/src/distances/RatcliffObershelp.jl:56
[9] compare at /Users/alex/.julia/packages/StringDistances/PGDwX/src/compare.jl:9 [inlined]
[10] compare(::TokenMax{RatcliffObershelp}, ::String, ::String) at /Users/alex/.julia/packages/StringDistances/PGDwX/src/compare.jl:181
[11] top-level scope at none:0
When I was working with DamerauLevenshtein
function evaluate
I've noticed that it allocates a lot. Problem is with these lines: https://github.com/matthieugomez/StringDistances.jl/blob/master/src/edit.jl#L146-L147
It doesn't look much when you compare only two words, but in my particular case, it was needed to compare thousands of words in a single run and these allocations became a real problem. I rewrote this algorithm by adding two mutable vectors that were reused between calculation and saw almost 80% drop in allocations and 10%-20% increase in speed. If it is interesting I can make a PR, which will add mutating version of this and Levenshtein
evaluate
function.
@JuliaRegistrator register()
In real world cases its common to find string mismatch due to words that sound the same but are written differently like meet-meat, one-won, for-four, ate-eight, and many others.
In the following example, the pairwise
function from Distances.jl
does not work.
Versions:
* Julia: 1.5.1
* Distances: 0.9.2
* StringDistances: 0.8.0
julia> using Distances, StringDistances
julia> pairwise(Jaro(), ["hello" "world"])
ERROR: MethodError: result_type(::Jaro, ::Array{String,2}, ::Array{String,2}) is ambiguous. Candidates:
result_type(dist::PreMetric, a::AbstractArray, b::AbstractArray) in Distances at /home/simon/.julia/packages/Distances/1CdE2/src/generic.jl:36
result_type(dist::Union{DamerauLevenshtein, Jaro, Levenshtein, RatcliffObershelp, StringDistances.QGramDistance, StringDistances.Normalize, Partial, TokenMax, TokenSet, TokenSort, Winkler}, s1, s2) in StringDistances at /home/simon/.julia/packages/StringDistances/67cTd/src/StringDistances.jl:12
Possible fix, define
result_type(::Union{DamerauLevenshtein, Jaro, Levenshtein, RatcliffObershelp, StringDistances.QGramDistance, StringDistances.Normalize, Partial, TokenMax, TokenSet, TokenSort, Winkler}, ::AbstractArray, ::AbstractArray)
Stacktrace:
[1] pairwise(::Jaro, ::Array{String,2}; dims::Nothing) at /home/simon/.julia/packages/Distances/1CdE2/src/generic.jl:212
[2] pairwise(::Jaro, ::Array{String,2}) at /home/simon/.julia/packages/Distances/1CdE2/src/generic.jl:209
[3] top-level scope at REPL[7]:1
I suspect that the reason for this ambiguity is making StringDistance
a union, see #19
I was wondering whether this library uses the bit parallel speed up tricks from:
https://www.win.tue.nl/~jfg/educ/bit.mat.pdf
https://users.dcc.uchile.cl/~gnavarro/ps/jea06.pdf
They seem very worthwhile.
Would it be posible to use this package to operate on vectors, e.g,
evaluate(Levenshtein(), [1,2,3], [1,2,10]) == 1 # They differ in one element
?
I see that you have switched to a dictionary-based count iterator. But how can it be safe to use UInt8s to count the number of occurrences of a q-gram? Since UInt8(255)+UInt8(1) == UInt8(0)
this would mean that if one of the strings have more than 255 occurrences of one and the same q-gram all of the QGram distances will give the wrong result.
Maybe better to use the max num of q-grams in the strings (s1 and s2) to select how many bits of the UInt to use?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.