Comments (4)
Note that it goes even deeper since all but QGram returns NaN if both inputs are shorter than q:
julia> filter(d -> isnan(d(2)("", "")), [QGram, Cosine, Jaccard, Overlap, SorensenDice, MorisitaOverlap, NMD])
6-element Vector{DataType}:
Cosine
Jaccard
Overlap
SorensenDice
MorisitaOverlap
NMD
julia> QGram(1)("", "")
0
julia> QGram(2)("a", "b")
0
from stringdistances.jl.
I am not sure there is an issue with the current implementation. The way I think about it is that there is a formula for each Qgram distance (given in the docs), which is valid even when the set of qgrams is empty. In some distances, the length of qgrams appears in the denominator, which is why the distance returns NaN when the set of qgrams is empty.
from stringdistances.jl.
Mathematically I agree.
I see dangers in actual use but ok, people will have to handle it themselves. I guess a simple solution might be to just highlight somewhere in the documentation that one can add a safe evaluate
method like so:
julia> function safeevaluate(D::Union{Cosine, Overlap, MorisitaOverlap}, s1, s2)
length(s1) >= D.q && length(s2) >= D.q && return(evaluate(D, s1, s2))
throw(ArgumentError("An argument is shorter than q ($(D.q)): \"$s1\", \"$s2\""))
end
safeevaluate (generic function with 1 method)
julia> D = Cosine(2)
Cosine(2)
julia> @assert safeevaluate(D, "aa", "bb") == evaluate(D, "aa", "bb")
julia> safeevaluate(D, "", "bb")
ERROR: ArgumentError: An argument is shorter than q (2): "", "bb"
Stacktrace:
[1] safeevaluate(D::Cosine, s1::String, s2::String)
@ Main ./REPL[2]:3
[2] top-level scope
@ REPL[5]:1
julia> function safeevaluate(D::Union{Jaccard, SorensenDice, NMD}, s1, s2)
(length(s1) >= D.q || length(s2) >= D.q) && return(evaluate(D, s1, s2))
throw(ArgumentError("An argument is shorter than q ($(D.q)): \"$s1\", \"$s2\""))
end
safeevaluate (generic function with 2 methods)
julia> safeevaluate(Jaccard(2), "", "")
ERROR: ArgumentError: An argument is shorter than q (2): "", ""
Stacktrace:
[1] safeevaluate(D::Jaccard, s1::String, s2::String)
@ Main ./REPL[11]:3
[2] top-level scope
@ REPL[12]:1
from stringdistances.jl.
Sure. I'm also open to change things — following what other libraries typically do in this case. I will leave this issue open.
from stringdistances.jl.
Related Issues (20)
- Phonetic distance HOT 1
- Tag a new version HOT 1
- `Base.findmin(s1, s2, dist::Partial)`
- bug in `DamerauLevenshtein` HOT 9
- `compare` with `Partial` distances gives negative answers HOT 4
- DamerauLevenshtein() vs Levenshtein() why the same distance ? HOT 1
- Speeding up qgram distances with pre-counting of qgrams HOT 9
- (Partial) Hamming distance HOT 5
- TagBot trigger issue HOT 5
- Simpler QGramDistances implementation and prep for general dictionaries and iterators HOT 5
- `Partial` only looks at substrings of the same length... HOT 1
- pairwise not working with StringDistances HOT 3
- unexpected behavior when computing distance with an array HOT 2
- Non-strings HOT 4
- The value of "compare" is probably wrong. HOT 1
- Feature Request: Parallel processing HOT 4
- incremental compilation may be fatally broken for this module HOT 5
- Julia v1.7 Jaro() doesn't work HOT 2
- incomplete readme documentation HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from stringdistances.jl.