turnerj / quickenshtein Goto Github PK

View Code? Open in Web Editor NEW

274.0 5.0 14.0 332 KB

Making the quickest and most memory efficient implementation of Levenshtein Distance with SIMD and Threading support

License: MIT License

C# 100.00%

levenshtein-distance string-distance simd hardware-intrinsics threading edit-distance levenshtein

quickenshtein's Introduction

Hello. I'm Turnerj

aka. James Turner - a programmer and entrepreneur with a love of cars, music and technology

Projects

As a Core Contributor

Support My OSS Work

I haven't scratched the surface of the projects I want to build or work on. Supporting me allows not only the work for these projects to continue, it also helps me to develop other new and interesting projects.

quickenshtein's People

Contributors

Stargazers

Watchers

Forkers

daniel-svensson adixt isachpaz bladezhang mgoodfellow ingted isaac-rabin marcodimarek andreabrantes kolyamolya exelanska pdecembre minthantsin

quickenshtein's Issues

Benchmarking individual pieces of Quickenshtein

As a thought coming from my own tweet about how SSE2 being enabled didn't seem faster than no intrinsics leads me to believe that I need to better measure individual optimizations as well as everything combined.

That is to say, if I improve FillRow, I should have a FillRow benchmark. As these are private methods, I'd probably have them marked as internal and use that configuration thing to allow certain libraries access to it.

Initially, it would be for benchmarking but might even be a good idea at some really targeted testing of these components.

Update BDN Columns

A proper "Speedup" column and a "Worthiness" column would be good additions in comparison to the baselines and Fastenshtein.

See: damageboy/VxSort@75b3e8d

Example

Re-look at Threading

As long as I can have it allocation-free (or if it must have allocations, make threading optional), there are a few things I could do to make it work.

Use the ThreadPool rather than Parallel.For - assuming that it is faster & allocates less.
With the thread boundaries, have arrays spanning the rows allocated by ArrayPool
Use the presence of a non-zero value in a "previous" boundary to know whether it can start

Having a full array spanning the rows at each boundary prevents any weirdness with having to stall the next thread across manually. All threads (except the first thread) just need to check if the left boundary has completed the row it is up to.

I'd still probably have it on a while (true) loop checking values simply because all locking attempts I've seen are either "too slow" (I don't want to wait milliseconds), allocate and/or require disposal.

I'd probably pass pointer references to both to the strings and their boundaries (so up to 4 pointers). Besides that, each thread would only need to know how many rows (total), how many columns (individual) and their individual start column index.

If I understand everything right, I should be able to use SIMD instructions on top of threading which effectively will allow all my optimizations at once.

Usage in .NetStandard2.0 library can cause MethodNotFound exception

Hi,

If you include this package into a .NetStandard2.0 library, and invoke the method signatures of

Levenshtein.GetDistance(string, string) or Levenshtein.GetDistance(string, string, CalculationOptions)

Then call this from a test project (for example), where the project is .net5 / .netcore etc, then when invoking the method, you will get a MissingMethodException

This happens because the methods are only present in netstandard

This can be fixed by using string extensions .AsSpan() to cause the other form included in all outputs to be used instead.

Unless I'm missing something, why are these methods only included if this is a .netstandard library?

Branch Miss Analysis

Similar to how damageboy does in their blog post for sorting, it would be useful to track branch misses etc with the same tool - perf, a program on Linux.

This is how damageboy called it:

$ perf record -F max -e instructions,branches,branch-misses \
    ./Example --type-list DoublePumpOverlined \
              --size-list 100000 --max-loops 1000 --no-check

Does perf work well in WSL? What exactly do I need the example program to do to be compatible with it?

This could be a useful utility to make sure we are optimised with our branching logic and whether there are any gains by optimising this further.

Full-branchless Calculation with SIMD

I've tried 3 times so far to have a fully branchless calculation by leveraging comparisons through SIMD. Unfortunately every attempt I've done has been slower than having the branch. This is likely in part that a branchless option actually was doing more operations (the "min" calls) and I'm still not actually taking advantage of the size of the vector (I'm really just operating on the first item).

First Version: https://gist.github.com/Turnerj/92ef8b18437042c9f15a6b7e932058b7
This version had the clever idea of calculating the substitution costs and deletion costs a chunk at a time. It actually made use of the whole vector by doing this and thus, should have been less operations overall. While I don't have the benchmark handy anymore, I believe it performed worse than Fastenshtein.

Second Version: https://gist.github.com/Turnerj/76b69b10acefa5d9dde99f362b7f13f8
Faster than the first version, this version does a few things differently. Taking advantage of MinHorizontal can allow just a single call to find the lowest value. I don't have the docs in front of me now but I believe I was limited to ushort and while 65K characters is a big string, I'd really like to support 100K+ string sizes. Anyway, this version was about the same speed as Fastenshtein as far as I can recall.

Third Version: https://gist.github.com/Turnerj/c3f1962825f28f2e7798f85d25c3a137
It is much faster than the first or second version but it still is slower than the current master. It is smarter with loading the source characters and isn't limited to ushort like the second attempt.

If I'm going to try a fourth attempt, to make it easier for myself, I really should break the benchmarking up like I've mentioned in #5 as writing all of that in an unrolled codebase was a PITA.

Improving the Diagonal Calculation

A few points:

Diagonal calculation isn't effective with its use of caching vector-sized strings, specifically the target vector which needs to be flipped. It flips it multiple times because of how the data iterates. If the diagonal calculation could be both diagonal within a column the width of the vector, it would eliminate this.
Threading doesn't make use of the diagonal calculation because the boundaries cannot get the correct data easily (both the back boundary for the next thread and the forward boundary for the current thread). Using the opposite of the technique above, instead of diagonal within a column the width of the vector, do diagonal in the row the width of the vector.

Preliminary calculations on performance for single threaded improvement (take these with a grain of salt, they all don't calculate correctly):

No target lookups and no shuffles/permutations aka. Process Column (at best - we would still need one of these calls per vector but we would avoid a lot of duplicates)

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
Quickenshtein	23.67 ms	0.125 ms	0.117 ms	-	-	-	-

No source lookups aka. Process Row (at best - we would still need one of these calls per vector but we would avoid a lot of duplicates)

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
Quickenshtein	24.50 ms	0.143 ms	0.134 ms	-	-	-	40 B

No source or target lookups (INVALID)

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
Quickenshtein	21.20 ms	0.138 ms	0.130 ms	-	-	-	40 B

For comparison, it takes about 26.8 ms to calculate it currently with the regular diagonal calculation. This means at best, single threaded could have a 12% improvement however it is likely to be closer to 8% for the "correct" solution.

No idea about how the threading code would perform though.

Improving Threading Calculation Performance

For example, use the Concurrency Visualizer extension to identify if there are any areas that can be improved.

This is more a curiousity than anything - my own testing so far has seen great gains in multi-threaded workloads however there could still be more room for improvement.

If possible, there likely are big gains by supporting diagonal calculation in a multi-threaded scenario however that would likely require a major re-write.

Add k-threshold support

This puts an upper-bounds on how high the calculation should actually calculate - once this value is hit, the code would stop processing the edit distance. This is useful for determining if a string was within a range of similarity (from equal to k).

Identify "slowdown" with one of the two strings being empty

This "slowdown" is very small - like 10 to 13 nanoseconds however it is more the big difference between it and Fastenshtein for the same fundamental logic.

Fastenshtein

Blank source and target (2.2 ns)
10 character source and blank target (2.2 ns)
Blank source and 10 character target (15.8 ns) (This path doesn't apply shortcuts)

Quickenshtein

Blank source and target (13.6 ns)
10 character source and blank target (13.6 ns)
Blank source and 10 character target (13.6 ns)

In each one for Quickenshtein, I'm only doing one or two if-statements and returning a result. My theory is it might be a code-size/CPU cache issue.

The reason to even identify this isn't so much about saving 11 nanoseconds on the same operation, it is understand why there would be a difference in the first place.

Improving Instruction Level Dependency

See example: https://www.youtube.com/watch?v=DJVTtIjwZHE

It looks like there could be a bunch of spots where this type of optimization might be able to help a great deal.

feature: edit script

hello,

is there a chance you consider making a way to obtain an edit script (actual insertions/deletions/substitutions) for the shortest distance? (actual insertions/deletions/substitutions)

thanks,

On .NET framework 4.8 this library is slower than super simple 1 file equivalent. Is that normal?

Hi, I've been running a benchmark and found that it's slower than single file <100 lines Fastenshtein algo

Is that because I'm NOT on the .NET CORE framework?
You algo, doesn't allocate memory, that's awesome, but still slower !¿?
My test compare rather small strings < 20 characters.

Other implementation
https://github.com/DanHarltey/Fastenshtein/blob/master/src/Fastenshtein/Levenshtein.cs

BenchmarkDotNet=v0.13.2, OS=Windows 10 (10.0.19043.2130/21H1/May2021Update)
AMD Ryzen 9 5900X, 1 CPU, 24 logical and 12 physical cores
[Host] : .NET Framework 4.8 (4.8.4515.0), X86 LegacyJIT
DefaultJob : .NET Framework 4.8 (4.8.4515.0), X86 LegacyJIT

Method	N	Mean	Error	StdDev	Gen0	Allocated
Fastenshtein	100	598.5 ns	1.67 ns	1.40 ns	0.3443	573 B
Quickenshtein	100	901.5 ns	1.26 ns	1.05 ns	-	-