buybackoff / 1brc Goto Github PK

View Code? Open in Web Editor NEW

417.0 11.0 42.0 363 KB

1BRC in .NET among fastest on Linux

Home Page: https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-fastest-on-linux-my-optimization-journey/

License: MIT License

C# 99.30% Shell 0.70%

1brc dotnet performance simd

1brc's Introduction

.NET Solution for One Billion Row Challenge

My blog post: https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-fastest-on-linux-my-optimization-journey/

Results

See results in the blog post.

A separate repository for automated benchmarks: https://github.com/buybackoff/1brc-bench.

Build & Run on Linux

To install .NET on Linux, follow official instructions.

To build, run ./build.sh.

To run JIT version: ./jit.sh /path/to/measurements.txt.

To run AOT version: ./aot.sh /path/to/measurements.txt.

1brc's People

Contributors

Stargazers

Watchers

Forkers

huangskar charmis mraipsec-mra chapter02 unitycoder barracoder dadhi jianyuyanyu rubenwe hrouidi genoher denisenko93 devbox10 maybelaterornot michelz glennwiz zyxyzk to11mtm thenasim narsis77 thelonelyvulpes mikigit11 theretopwq karthikkashyap ole-e-ole edenore henricch gradassa davidpetric a-stankevich amirulandalib bluetianx carsonpan hixixtech annngellalala dexcompiler gambitkz ggais wofll-forks morningcigarette xhanb cjvandyk

1brc's Issues

Last benchmark for kpocza's repo

I've made some improvements to my implementation (https://github.com/kpocza/1brc) mostly by decreasing hash collisions, exploiting cache locality.
I expect it to run 10-20% faster (especially for the 10k tests), but far slower than the best competitors.
Please rerun you benchmarks.

New C++ version 3x faster on 10k key dataset

https://github.com/lehuyduc/1brc-simd

Hi, I've updated my code to optimize for the 10k keys dataset. On my PC it's ~3x faster (excluding munmap time) than the commit you tested. Default dataset performance is a bit slower.

Just ./run_cpp.sh to compile and run.

To test the effect of hyper threading, you can do ./run_cpp.sh 12 12 (12 == number of threads total on your CPU). You will see interesting effects on the 10K dataset :D

Thanks! Looking forwards to your updated result.

Add implementation from Cameron Aavik

As per your recommendation on twitter, opening an issue here to see if you can add my solution to your .NET results comparison table: https://github.com/CameronAavik/1brc

I'm using Windows but according to the total process time my solution seems to be ~0.87s faster 2.22s -> 1.35s, and according to Stopwatch inside the program it is ~0.11s faster 1.45s -> 1.34s.

measurements

Measure kpocza's implementation

Please measure my implementation (https://github.com/kpocza/1brc) and add to the results table if it performs well on you machine.

Re-run .NET implementation (pedrosakuma)

Victor,

When you get the time, could you re-run with my last set of implementations?

https://github.com/pedrosakuma/1brc

Thanks a lot!

Add noahfalk entry

Hi @buybackoff - thanks for the great article and running this unofficial leaderboard. Would you mind including my entry in your run? https://github.com/noahfalk/1brc/

Its quite speedy from my own benchmarking and I hope the results will be reproducible for you. I put some more info in the README on my repo but if you have any questions about it feel free to reach out. Thanks!

Last refresh

Hello Victor,

I know is very close to the official cut time.
I submitted a PR on 1brc-bench with my repository.

Thanks in advance.

Add my 'Safe' (No Pointer) solution

I have implemented my solution with a self imposed restriction to avoid Unsafe and stay managed only. It supports both \n as well as \r\n.

Some ideas of improvements

I was looking at the code and thought there are some things worth trying to further improve the execution time:

pass Summary by ref to Merge
Summary is 16 bytes and for min/max int is not needed, short should be enough.
.Select(x => (Name: x.Key.ToString(), x.Value)) for final result may not be needed, instead Utf8Span could implement IComparable and do comparison on Span property
if the previous point is implemented OrderBy can be replaced with something like Span<Utf8Span> spans = stackalloc Utf8Span[result.Count]; spans.Sort();
apply [SkipLocalsInit] to Summary since it looks like the Init method is always invoked for a new instance anyway
.ToList() may not be needed before calling Aggregate because Aggregate already has a version for ParallelQuery<TSource>
Console.Write($"{pair.Name} = {pair.Value}"); allocates an unnecessary string, could instead rewrite it as Console.Write(pair.Name); Console.Write(" = "); Console.Write(pair.Value) however not sure what is the performance of actual Console api, maybe the better strategy would be to just form 1 big resulting string and invoke a single Console.Write?

Sum should be long

@buybackoff max measurement is 99.9 or 100, 1B * 100 > int.MaxValue, so for the general case this would result in overflow. Using long probably won't hurt perf, significantly.

Implement ISpanFormattable on Summary

Hi,

I think you can win something by implementing ISpanFormattable on Summary. This will make the string interpolation in Console.Write($"{pair.Name}={pair.Value}") run faster. You avoid the ToString() method.

You can try this using this code. It was faster on my (slow) computer.

public struct Summary : ISpanFormattable {

    public bool TryFormat(Span<char> destination, out int charsWritten, ReadOnlySpan<char> format, IFormatProvider? provider) {
        return destination.TryWrite($"{Min / 10.0:N1}/{Average / 10.0:N1}/{Max / 10.0:N1}", out charsWritten);
    }

    public string ToString(string? format, IFormatProvider? formatProvider) {
        return ToString();
    }

If it works you can even gain more performance by reimplementing the TryFormat method without the use of the TryWrite extensionmethod. It quite a lot of work but might work.

Regards,

Fons

Add entry for dzaima

https://github.com/dzaima/1brc - build with make a.out, run with THREADS_1BRC=[desired thread count] ./a.out path/to/measurements.txt.

Language column is mildly non-trivial - there's C, C++, and Singeli involved (Singeli generating gen.c which is committed to the repo to not require Singeli to build).

I expect it to roughly match lehuyduc's entry on both the original & 10K datasets (on my PC my solution is ~10% faster for 10K, and roughly the same for original).

(the repo also has a Java solution, but it's only optimized for the original dataset, and has extremely high variance (>2x; JIT compilation screwing up? haven't bothered looking into it), and with JIT startup time and worse codegen it struggles even at the best of times, so not worth including)

IndexOfNewlineChar optimization?

Not entirely sure I'm following your code (I can't find where the powers of 10 are actually used, for example).

But something struck me about IndexOfNewlineChar: if only \r\n and \n variants are supported (not the \r-only variant that is a relic of old Macs), you can look only for \n, and if found and idx > 0, then go see if it is preceeded by \r and thus needs a 2-character stride. Something like:

internal static int IndexOfNewlineChar(ReadOnlySpan<byte> span, out int stride)
 {
    stride = default;
    int idx = span.IndexOf((byte)'\n');
    if (idx < 0) return;
    int lastIdx = idx - 1;
    if (idx > 0 && span[lastIdx] == '\r') {
       stride = 2;
       return lastIdx;
    }
    stride = 1;
    return idx;
}

Remeasure kpocza's implementation

I've made some improvements to the original version.
Please rerun your measurements on my implementation (https://github.com/kpocza/1brc) and add to the results table if it performs better. I've prechecked the output and it conforms to the previous version.