wsm2110 / faster.map Goto Github PK
View Code? Open in Web Editor NEWCollection of incredibly fast hashmaps
License: MIT License
Collection of incredibly fast hashmaps
License: MIT License
I was looking into the code and I am not sure you are taking care of hash collisions
Inspiration can be found here: https://stackoverflow.com/questions/32027271/generate-two-different-strings-with-the-same-hashcode
It would be good to expose a method
public ref TValue GetOrAddValueRef(TKey key)
(like in DictionarySlim).
That would help in case the TValue is a big struct and also to avoid double-lookups in quite common scenario like:
if (map.Get(key, out var oldValue)) // 1st
map.Update(key, CalculateNewValue(oldValue)); //2nd
else
map.Emplace(key, default); //2nd
vs.
ref var value = ref map.GetOrAddValueRef(key);
value = CalculateNewValue(value);
Minimum working example:
var map = new MultiMap<int, int>();
for (int i = 0; i < 1000; i++)
{
// Console.WriteLine(i);
map.Emplace(2, i); // line 86
}
The piece of code above throws the following error on my machine on the 914th iteration:
Error message:
System.IndexOutOfRangeException : Index was outside the bounds of the array.
Stacktrace:
at <snip>.Tests.MultimapTest() in <snip>\Tests.cs:line 86
Similar problems also happen if you replace the 2
with other numbers or new object()
, etc.
System information:
.NET 6.0.400
Target Framework: net6.0 (bin), netstandard2.0 (lib)
System: Windows 11 22H2 x64
These changes would be a great addition:
Hey,
I have noticed that we can easily address this issue. Currently we only use two x86 SSE2 instructions - Sse2.CompareEqual
and Sse2.MoveMask
. Those two can be easily replaced with platform independent equivalents - the static method Vector128.Equal
and the extension method Vector128.ExtractMostSignificantBits
.
At run-time the JIT will replace them with SSE2 or their analogues on ARM, depending on which architecture the app is running on.
So basically code like this
Faster.Map/src/DenseMapSIMD.cs
Lines 231 to 235 in 7e1fb1e
int result = (int)Vector128.Equals(left, right).ExtractMostSignificantBits();
I've tried and replaced the SSE2 calls with Vector128.Equal
and Vector128.ExtractMostSignificantBits
and it started working on my brother's Macbook Air with M1 processor without x86 emulation.
To make it work, we also need to change this check
Faster.Map/src/DenseMapSIMD.cs
Lines 153 to 156 in 7e1fb1e
if (!Vector128.IsHardwareAccelerated)
{
throw new NotSupportedException("Simd is not supported");
}
Isn't that cool or what?
Not sure why this is happening.
Minimum working example:
[TestMethod]
public void AnItemHasBeenLost()
{
List<long> keysList = new List<long>()
{
3053810, 6107620, 9161430, 12215240, 39699530, 36645720, 33591910, 30538100,
27484290, 109937160, 106883350, 103829540, 100775730, 97721920, 82452870, 79399060, 76345250, 73291440, 70237630,
198497650, 195443840, 192390030, 189336220, 186282410, 171013360, 167959550, 164905740, 161851930, 158798120, 128260020,
125206210, 122152400, 119098590, 116044780, 94668110, 91614300, 88560490, 296219570, 293165760, 290111950, 287058140,
284004330, 268735280, 265681470, 262627660, 259573850, 256520040, 225981940, 222928130, 219874320, 216820510, 213766700,
207659080, 204605270, 201551460, 180174790, 177120980, 174067170, 137421450, 134367640, 131313830, 433641020, 430587210,
427533400, 424479590, 421425780, 406156730, 403102920, 400049110, 396995300, 393941490, 363403390, 360349580, 357295770,
354241960, 351188150, 345080530, 342026720, 338972910, 335919100, 332865290, 329811480, 326757670, 323703860, 317596240,
314542430, 311488620, 308434810, 305381000, 302327190, 299273380, 274842900, 271789090, 253466230, 247358610, 244304800,
241250990, 238197180, 235143370, 232089560, 229035750, 210712890, 149636690, 146582880, 143529070, 140475260, 601600570,
598546760, 595492950, 592439140, 589385330, 574116280, 571062470, 568008660, 564954850, 561901040, 531362940, 528309130,
525255320, 522201510, 519147700, 513040080, 509986270, 506932460, 503878650, 500824840, 497771030, 494717220, 491663410,
485555790, 482501980, 479448170, 476394360, 473340550, 470286740, 467232930, 464179120, 461125310, 458071500, 455017690,
451963880, 442802450, 439748640, 436694830, 415318160, 412264350, 409210540, 390887680, 387833870, 384780060, 381726250,
378672440, 375618630, 372564820, 369511010, 366457200, 348134340, 277896710, 250412420, 781775360, 778721550, 775667740,
772613930, 769560120, 754291070, 751237260, 748183450, 745129640, 742075830, 711537730, 708483920, 705430110, 702376300,
699322490, 693214870, 690161060, 687107250, 684053440, 680999630, 677945820, 674892010, 671838200, 665730580, 662676770,
659622960, 656569150, 653515340, 650461530, 647407720, 644353910, 641300100, 638246290, 635192480, 632138670, 622977240,
619923430, 616869620, 613815810, 610762000, 607708190, 604654380, 586331520, 583277710, 580223900, 577170090, 558847230,
555793420, 552739610, 549685800, 546631990, 543578180, 540524370, 537470560, 534416750, 516093890, 488609600, 448910070,
445856260, 418371970, 320650050, 280950520, 183228600, 992488250, 989434440, 986380630, 983326820, 980273010, 965003960,
961950150, 958896340, 955842530
};
long missingKey = 439748640L;
Assert.IsTrue(keysList.Contains(missingKey), "Sanity check failed");
var keysSet = new DenseMapSIMD<long, long>(190);
keysSet.Emplace(0L, 0L);
foreach (var key in keysList)
{
if (!keysSet.Contains(key))
keysSet.Emplace(key, key);
if (key == missingKey)
Assert.IsTrue(keysSet.Contains(missingKey), "Sanity check failed");
}
Assert.IsTrue(keysSet.Contains(missingKey), "Entry has been lost");
}
DenseMapSIMD Throws IndexOutOfRangeException inside Contains when small length was passed into constructor.
Minimum working example:
var fmap = new DenseMapSIMD<long, long>(1);
fmap.Emplace(0L, 0L);
var r = fmap.Contains(1);
The piece of code above throws IndexOutOfRangeException inside GetArrayEntryRef
Most likely the reason is length
instead of _length
here:
Faster.Map/src/DenseMapSIMD.cs
Lines 174 to 177 in 27ede7a
Hi,
Do you have any rough idea how the Contains method of these collections compares to those in Dictionary and SlimDictionary in terms of performance? Can you add benchmarks when you get free time for cases when the key is present and not present in the collection?
Thank you for the great work!
Maybe we should consider changing this line inside Emplace()
Faster.Map/src/DenseMapSIMD.cs
Line 244 in 7e1fb1e
ref readonly var entry = ref GetArrayValRef(_entries, indexAndOffset);
and this line inside Get()
Faster.Map/src/DenseMapSIMD.cs
Line 351 in 7e1fb1e
ref readonly var entry = ref GetArrayValRef(_entries, index + Unsafe.As<int, uint>(ref offset));
This way we can avoid copying large entries (when we have large TKey and/or TValue) from the _entries array. What do you think?
In the Copy method below:
Faster.Map/src/DenseMapSIMD.cs
Lines 608 to 620 in 99a1119
shouldn't this if statement check for denseMap._metadata[i] < 0 rather than _metadata[i] < 0 as it is now?
Faster.Map/src/DenseMapSIMD.cs
Lines 612 to 615 in 99a1119
Hi,
I noticed that jump_distances is read-only and never modified. Is it safe to make it static, so it is not allocated per instance?
Faster.Map/src/DenseMapSIMD.cs
Lines 131 to 141 in 99a1119
Are you sure this code is pipelining friendly? It looks quite the opposite to me. Every line relies on the result from the previous line.
Faster.Map/src/DenseMapSIMD.cs
Lines 812 to 815 in 10ed34a
The generated asm code also doesn't suggest pipelining friendliness.
8BC1 mov eax, ecx
C1E810 shr eax, 16
33C1 xor eax, ecx
69C86BCAEB85 imul ecx, eax, 0xFFFFFFFF85EBCA6B
8BC1 mov eax, ecx ;in order to start the calculations on the second line we need the result from ecx (previous line)
C1E80D shr eax, 13
33C1 xor eax, ecx
69C835AEB2C2 imul ecx, eax, 0xFFFFFFFFC2B2AE35
8BC1 mov eax, ecx ;here again we need the result from the previous line in order to calculate the return value
C1E810 shr eax, 16
33C1 xor eax, ecx
Am I missing something?
According to your documentation, Emplace method should return false if the key already exists
Faster.Map/src/DenseMapSIMD.cs
Lines 211 to 218 in 7c07656
Faster.Map/src/DenseMapSIMD.cs
Lines 255 to 258 in 7c07656
As you may know, unlike C++, in C# there's a runtime check if you attempt to access an element at an index that is outside the bounds of an array. This is by default. In some cases the JIT compiler is "smart" enough to recognize patterns which guarantee that the index would never be out of the bounds of the array.
Such pattern is
for (var i=0; i<array.Length; i++)
{
}
In such case the JIT does not emit native code that does the checks.
In our code though, there are dynamically calculated indexes that can't hint the JIT to remove the checks. If we are 100% sure that we won't try to get elements outside of the array bounds, we can remove the checks manually.
Here's what the generated native code looks like when you have checks
private int GetArrayValTestWithChecks(int[] array, int index)
{
return array[index];
}
Test.GetArrayValTestWithChecks(Int32[], Int32)
L0000: sub rsp, 0x28
L0004: cmp r8d, [rdx+8]
L0008: jae short L0016 ;when out of bounds we branch here to the throw line
L000a: mov eax, r8d
L000d: mov eax, [rdx+rax*4+0x10]
L0011: add rsp, 0x28
L0015: ret
L0016: call 0x00007fff38258b30 ;throws the exception
L001b: int3
Here's the case when the checks are eliminated
private int GetArrayValTestWithNoChecks(int[] array, int index)
{
ref var arr0 = ref MemoryMarshal.GetArrayDataReference(array);
return Unsafe.Add(ref arr0, index);
}
Test.GetArrayValTestWithNoChecks(Int32[], Int32)
L0000: cmp [rdx], dl
L0002: movsxd rax, r8d
L0005: mov eax, [rdx+rax*4+0x10]
L0009: ret
Less instructions to execute and no branches. The less branches you have in the code, the less wrong branch predictions and more consistent execution time
So I created these two helper methods
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static T GetArrayVal<T>(T[] array, uint index)
{
#if DEBUG
return array[index];
#else
ref var arr0 = ref MemoryMarshal.GetArrayDataReference(array);
return Unsafe.Add(ref arr0, index);
#endif
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static ref T GetArrayValRef<T>(T[] array, uint index)
{
#if DEBUG
return ref array[index];
#else
ref var arr0 = ref MemoryMarshal.GetArrayDataReference(array);
return ref Unsafe.Add(ref arr0, index);
#endif
}
and amended these Get() method lines
from this
var right = Vector128.LoadUnsafe(ref _metadata[index], jumpDistance);
to this
var right = Vector128.LoadUnsafe(ref GetArrayValRef(_metadata, index), jumpDistance);
from this
var entry = _entries[index + jumpDistance + offset];
to this
var entry = GetArrayVal(_entries, index + jumpDistance + Unsafe.As<int, uint>(ref offset));
and finally, from this
jumpDistance = jump_distances[jumpDistanceIndex];
to this
jumpDistance = GetArrayVal(jump_distances, jumpDistanceIndex);
On my machine the final result for the GetBenchmark is from 5-7% better. It's not significant but it is some gain. Please note that if there's something wrong with the algorithm in the future (if we introduce a bug) and we try to access elements outside of the bounds of the array, it will throw an exception in DEBUG, but will crash in RELEASE.
The downside of this optimization is that you won't know what's wrong in RELEASE if there's a bug. It's up to you if you want to use it or not.
P.S.
A kind reminder that the current implementation of IndexOf can give you wrong result, because of the allocation of uninitialized array with GC.AllocateUninitializedArray. We should either remove it or amend it accordingly to check the metadata first before returning the index
I spotted something suspicious while looking at the code.
Faster.Map/src/DenseMapSIMD.cs
Lines 175 to 180 in 80e97b7
I've tested your last commit and it returns wrong results for Get()
Please note that the original code was calling LoadUnsafe with two parameters -> source (ref _metadata[index]
) and offset (jumpDistance
)
var right = Vector128.LoadUnsafe(ref _metadata[index], jumpDistance);
In the new code we call it with one parameter -> only the source (ref GetArrayValRef(_metadata, index)
)
var right = Vector128.LoadUnsafe(ref GetArrayValRef(_metadata, index));
which is the same as
var right = Vector128.LoadUnsafe(ref _metadata[index]);
but without the bounds check.
This is because we add the offset to the index variable later on in the code
//Calculate jumpDistance
index += GetArrayVal(jump_distances, jumpDistanceIndex);
However it is important to subtract the previous jumpDistance, before adding the new one, as it was in my code.
//subtract last jump distance from the index
index -= jumpDistance;
//set the new jump distance
jumpDistance = GetArrayVal(jump_distances, distanceIndex);
//recalculate index, taking into account the new jump distance
index += jumpDistance;
otherwise on the next loop iteration the code would look something like this
(this is pseudo code)
var right = Vector128.LoadUnsafe(ref _metadata[originalIndex + oldJumpDistance + newJumpDistance]);
And this is not what we want. We want
var right = Vector128.LoadUnsafe(ref _metadata[originalIndex + newJumpDistance]);
Also can you please explain this part
Faster.Map/src/DenseMapSIMD.cs
Lines 385 to 390 in cbd10bd
if (index > _length)
{
return false;
}
I read it as "If index > length, start all over again" which looks like endless loop to me
Generally, I am interested in your implementation, because I was trying to implement concurrent open-addressed map in the past (and failed).
As I understood, it is not thread-safe variants?
Btw, how it compares to the DictionarySlim?
Subject says it all ;-)
Method gets caught in endless while (true) loop ... adding Resize() similar to Emplace etc. solves it ...
[TestMethod]
public void AssertGetOrAddMultipleValues()
{
//assign
var map = new DenseMap<int, int>();
for (int i = 0; i < 20; i++)
{
map.GetOrAdd (i); // caught in endless while (true) loop at index 18
}
}
Aside of Sse2 for x86 platforms please also put Arm SIMD on your Roadmap
For Example AdvSimd.CompareEqual(), ...
Hey,
I have a question about this if block
in EmplaceInternal()
:
Faster.Map/src/DenseMapSIMD.cs
Lines 732 to 736 in 082023c
Currently, the only place EmplaceInternal()
is called from is from Resize()
where we first resize (double) the arrays and add the items from the old arrays (skipping the tombs). So I wonder whether it is possible at all with doubling the size of the arrays and skipping the tombs that we can still get to a moment where we would need to resize. This Resize()
<-->EmplaceInternal()
"recursion" kind of bothers me.
I have tested the code with the AddAndResizeBenchmark
and with the data we have, we never get into that if block.
This is a bit bold, but works nice for your algorithm. It is up to you if you want to amend it or not.
I have noticed that after the allocation of the new metadata array, you immediately initialize it with the emptyBucket value (-127). So basically the array is initialized twice - once with zeros (using new
) and then with -127. Also, the entries array is never used in the code without first checking the metadata array (well, except for the IndexOf() method which is for testing purposes, but it also can be amended to make metadata lookups first)
Faster.Map/src/DenseMapSIMD.cs
Lines 762 to 765 in 9c60000
So basically we can safely allocate both arrays without initializing them to zero (you don't rely on those zeros in the code anyway) using GC.AllocateUninitializedArray
. I have tested it and allocation of array of 1_000_000 ints without initialization is about 11 times faster.
_metadata = GC.AllocateUninitializedArray<sbyte>((int)_length + 16);
_entries = GC.AllocateUninitializedArray<Entry<TKey, TValue>>((int)_length + 16);
This brings also slight improvement to the final AddAndResize benchmark. Let me know what you think!
P.S.
If you don't like that the following line starts with new
new Span<sbyte>(_metadata).Fill(_emptyBucket);
the following is equivalent
_metadata.AsSpan().Fill(_emptyBucket);
I noticed that you amended the code to set the loadfactor to the field
Faster.Map/src/DenseMapSIMD.cs
Lines 180 to 183 in 1de97cb
however this line still uses the parameter and not the field
Faster.Map/src/DenseMapSIMD.cs
Line 194 in 1de97cb
Thanks for getting time to fix it
As the algorithm is currently implemented we have two special metadata states -> tombstone(-126) and emptyBucket(-127). Both of those are negative numbers. If we don't plan to add other special cases, we can safely remove some SIMD calls.
These lines in EmplaceInternal()
Faster.Map/src/DenseMapSIMD.cs
Lines 717 to 720 in 082023c
//check for empty entries
int result = Sse2.MoveMask(right);
and these in Emplace()
Faster.Map/src/DenseMapSIMD.cs
Lines 264 to 268 in 082023c
//check for tombstones - deleted and empty entries
result = Sse2.MoveMask(right);
The reason for this is that all the sbyte hashes in the metadata are positive numbers (most significant bit cleared) and all the tombstones and empty buckets are negative (most significant bit is set). So MoveMask will give us the same result as what the current Sse2.CompareGreaterThan(_emplaceBucketVector, right);
and Sse2.CompareEqual(_emptyBucketVector, right);
return.
Let me know if you disagree
I have a suggestion for the code bellow in the Resize() method:
Faster.Map/src/DenseMapSIMD.cs
Lines 758 to 762 in 883c9fe
var oldEntries = _entries.Clone();
var oldMetadata = _metadata.Clone();
It looks simpler and is slightly faster. I tried it with array of 1_000_000 ints and it was about 30% faster.
Both the Array.Copy and Array.Clone make shallow copy.
Hi, impressed by the performance of the SIMD map specifically.
Regarding the benchmarks, is it possible to add the allocation columns?
I noticed that the code started to run a bit slower after the adding of the _maxDistance optimization. I have run two tests with the old while(true)
and the new while (jumpDistanceIndex <= _maxDistance)
loop. The tests differ from each other only in the number of the existent and the non-existent keys
900_000 : 100_000 (existent to non-existent keys) ratio
Method | Mean | Error | StdDev | Ratio | RatioSD |
---|---|---|---|---|---|
DenseMapSIMD2_Get_WithMaxDistanceCheck | 10.65 ms | 0.030 ms | 0.060 ms | 1.00 | 0.00 |
DenseMapSIMD2_Get_WithoutMaxDistanceCheck | 10.04 ms | 0.167 ms | 0.336 ms | 0.94 | 0.03 |
500_000 : 500_000 (existent to non-existent keys) ratio
Method | Mean | Error | StdDev | Ratio | RatioSD |
---|---|---|---|---|---|
DenseMapSIMD2_Get_WithMaxDistanceCheck | 6.707 ms | 0.0984 ms | 0.1987 ms | 1.00 | 0.00 |
DenseMapSIMD2_Get_WithoutMaxDistanceCheck | 6.268 ms | 0.0449 ms | 0.0906 ms | 0.94 | 0.02 |
As you can see, in both cases the old while(true)
code works better. I've started to wonder why ...
After some examination of the code, I realized that this check is not necessary at all and I think it is partially my fault for this. I asked you in this issue whether the performance would be worse if the number of the non-existent keys is larger and kind of pushed you in the wrong direction.
It turns out that the code was already taking good care of the non-existent keys and the adding of additional code complexity was not necessary.
Here in this code snippet
Faster.Map/src/DenseMapSIMD.cs
Lines 374 to 380 in cbd10bd
Am I correct or am I missing something obvious here?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.