Code Monkey home page Code Monkey logo

Comments (13)

jamesturk avatar jamesturk commented on June 11, 2024 1

from jellyfish.

jamesturk avatar jamesturk commented on June 11, 2024

from jellyfish.

maxbachmann avatar maxbachmann commented on June 11, 2024

can you provide the benchmark code you are using, and details in the machine? this is not in line with what i’m seeing in my benchmarks and i’m wondering about if it’s an apples to apples comparison since the unicode normalization is 90 percent of the runtime

I do not see why you would need any unicode normalization. Since you receive unicode strings from Python this feels like a waste of time. The whole benchmark script is a bit longer, since it allows things like visualization + generation of a random dataset.

One other question if you’re able to answer it— some Hamming implementations refuse dissimilar strings which is the only way you’d get o(1) performance — is that what this other library is doing? The old C implementation of hamming is about as conscise as can be, so I am really curious what’s going on in the original bench

This other library does the same as jellyfish:

  • it accepts strings with any amount of similarity differences
  • it accepts length differences and adds them as insertion/deletion

I am really stunned by the performance difference as well. Did the old implementation copy the string for the comparision? In rapidfuzz I always work with a view on the internal representation inside the Python string.

This is a very simple benchmark (I can share the benchmark script above if that helps as well). These are very long sequences, so the python call overhead gets less relevant.

from timeit import timeit

setup="""
from jellyfish import hamming_distance as jellyfish
from rapidfuzz.distance.Hamming import distance as rapidfuzz
a="a"*200000
b="b"*200000
"""

print(timeit("jellyfish(a, b)", setup=setup, number=1000))
print(timeit("jellyfish(a, a)", setup=setup, number=1000))
print(timeit("rapidfuzz(a, b)", setup=setup, number=1000))
print(timeit("rapidfuzz(a, a)", setup=setup, number=1000))

This results in:
old library version

0.386396555997635
0.4072674199996982
0.03347623199806549
0.0331174099992495

new library version:

4.556943121999211
4.313729999998031
0.033683902001939714
0.03330945900233928

from jellyfish.

jamesturk avatar jamesturk commented on June 11, 2024

from jellyfish.

maxbachmann avatar maxbachmann commented on June 11, 2024

This is just PyO3/Rust overhead, and the unicode normalization matters because going character by character isn't the right way to handle unicode strings since there are multi-length chars.

Hm out of interest, does Python not perform the same kinds of normalization? As in does:

[x for x in my_string]

lead to any incorrect results?

from jellyfish.

maxbachmann avatar maxbachmann commented on June 11, 2024

Hm I just had a look at the old implementation which calls: unicodedata_normalize as well. Maybe I should do this too. Sounds like this would be worth it in terms of correctness 👍

For most metrics this should not really have a big impact on performance.

from jellyfish.

maxbachmann avatar maxbachmann commented on June 11, 2024

Hm especially for ascii strings (which I benchmarked) this should not have any performance impact, since they are already normalized. So it should be possible to avoid the overhead in this case.

from jellyfish.

jamesturk avatar jamesturk commented on June 11, 2024

note to self: Alternate Rust implementation to benchmark

pub fn vec_hamming_distance<T: PartialEq>(s1: &FastVec<T>, s2: &FastVec<T>) -> usize {
    let (longer, shorter) = if s1.len() > s2.len() {
        (s1, s2)
    } else {
        (s2, s1)
    };

    // distance is difference in length + differing chars
    let len_diff = longer.len() - shorter.len();
    let differing_chars = shorter.iter().zip(longer.iter()).map(|(a, b)| (a != b) as usize).sum();

    len_diff + differing_chars
}

Profiling the code shows that a lot of the time is spent there whether it is necessary or not, but I'm fine w/ that in the name of correctness at edge cases (which this library had plenty more of before adding these normalization years ago) and we're talking about nearly immeasureable amounts of time in practice so the correctness/safety is worth it for these use cases.

https://github.com/jamesturk/jellyfish/blob/main/benchmarks/compare.ipynb

Current average difference from the C version is 1.5x in practice, I see that with a 200k long string you have different results, but I can't think of a realistic use case anywhere near that long 😄

from jellyfish.

maxbachmann avatar maxbachmann commented on June 11, 2024

Current average difference from the C version is 1.5x in practice, I see that with a 200k long string you have different results, but I can't think of a realistic use case anywhere near that long smile

This was mostly an odd occurrence I noticed when benchmarking my other metrics. My first thought was that my benchmarks have to be broken somehow. I do however know of a user of my lib using the Levenshtein distance + Levenshtein edit operation backtracing with sequences of over 100k characters (to compare OCR results of archive documents) 😉

from jellyfish.

jamesturk avatar jamesturk commented on June 11, 2024

Gotcha, I'll keep this open until I get a chance to prod at the alternate implementation above & definitely appreciate you raising it in case there is some low hanging fruit but I don't want to micro-optimize & lose clarity/correctness for this one.

from jellyfish.

maxbachmann avatar maxbachmann commented on June 11, 2024

Yes I completely understand this. This looks pretty damming in the benchmarks, but given the low overall times it is not really of a lot of importance.

The implementation of Levenshtein and Jaro used in jellyfish are the ones with the largest potential for optimization, but these optimizations are a lot more complex as well 😓

levenshtein
jaro

from jellyfish.

jamesturk avatar jamesturk commented on June 11, 2024

These are interesting for sure, I'll need to add some length-based benchmarks on my end to see how this changes.

A tradeoff I made with the Rust version is the choice of a vector type that can pop between the stack & the heap, it's entirely predicated on the idea that most people are passing short strings. I realize I don't know if that's true or not, it's really just my own use cases/intuition that people are usually sending words not sentences.

Have you heard much from folks sending strings >30 characters in on a regular basis? I'm wondering what use cases I'm missing.

from jellyfish.

maxbachmann avatar maxbachmann commented on June 11, 2024

A tradeoff I made with the Rust version is the choice of a vector type that can pop between the stack & the heap,

Yes that's a pretty cool optimization. In C++ this is used e.g. in the std::string class as short string optimization as well.

it's entirely predicated on the idea that most people are passing short strings. I realize I don't know if that's true or not, it's really just my own use cases/intuition that people are usually sending words not sentences.

Most of my users have relatively short strings, but tons of them (as in millions of strings).

Have you heard much from folks sending strings >30 characters in on a regular basis? I'm wondering what use cases I'm missing.

For Jaro / JaroWinkler I have only seen relatively short sequences used. This is probably since the metric was designed to compare sequences with spelling mistakes.
For the Levenshtein distance I have absolutely seen people using longer strings. The example I was talking about earlier was https://github.com/qurator-spk/dinglehopper it's used a lot in computational biology as well.

Optimizations

I have tons of different optimizations I do behind the back, while I try to provide the user with a relatively simple interface. I provide an API to compare single sequences similar to jellyfish, but I provide one to compare multiple sequences in parallel as well. The API comparing single sequences is a lot more limited in possible optimizations.

For these functions I use the following things:

  • I use specialized implementations based on string length + character width. If available for a given metrics these are usually bitparallel implementations, which allow comparing 64 characters in parallel
  • I allow the user to provide a score cutoff which allows me to exit early if it is reached. In addition I provide a score hint, which is used to select an implementation which is slower if the score is lower, but can be significantly faster if it is at least as good.

When comparing multiple sequences at once I do the following things:

  • I do not call back into Python for each comparision. Instead I hide a PyCapsule in the function attributes, which allows me to directly call the corresponding C++ function.
  • I cache some parts between string comparisions. E.g. bitvectors for bitparallel algorithms can often be created ahead of time. Especially for short sequences this saves a lot of time.
  • for some metrics I use SIMD implementations to compare multiple short strings in parallel (e.g. compare 1 string to 32 strings with <= 8 characters using AVX2)
  • the user can enable multithreading

Algorithms

These are the complexities of my implementations of the algorithms used in jellyfish:

algorithm memory complexity time complexity
Levenshtein O(N) O([M/64]*N)
Jaro O(N) O([M/64]*N)
JaroWinkler O(N) O([M/64]*N)
DamerauLevenshtein O(N) O(M*N)

These algorithms have the following complexities in jellyfish:

algorithm memory complexity time complexity
Levenshtein O(N) O(M*N)
Jaro O(N) O(M*N)
JaroWinkler O(N) O(M*N)
DamerauLevenshtein O(M*N) O(M*N)

from jellyfish.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.