spreads / spreads.native Goto Github PK

View Code? Open in Web Editor NEW

25.0 25.0 1.0 4.52 MB

Blosc (byteshuffle, zstd, lz4, zip), mimalloc and CPU core id (sched_getcpu) native methods for .NET Standard 2.0

Home Page: http://docs.dataspreads.io/spreads/libs/native/api/README.html

License: Mozilla Public License 2.0

Rust 32.14% Batchfile 0.35% C# 45.53% Dockerfile 1.60% Smalltalk 0.01% C 7.96% C++ 12.41%

compression dotnet interop lz4 mimalloc rust zstd

spreads.native's Introduction

Spreads.Core

Spreads.Core contains several high-performance features: buffer pools, optimized binary/interpolation search, collections, threading utils, etc.

The "series and panels" part of the library is under very slow rewrite

Below is a very old readme

Spreads

The name Spreads stands for Series and Panels for Real-time and Exploratory Analysis of Data Streams.

Data Streams are unbounded sequences of data items, either recorded or arriving in real-time;
Series are navigable ordered data streams of key-value pairs;
Panels are series of series or data frames;
Exploratory data transformation in C#/F# REPLs;
Real-time fast incremental calculations.

Spreads is an ultra-fast library for complex event processing and time series manipulation. It could process tens of millions items per second per thread - historical and real-time data in the same fashion, which allows to build and test analytical systems on historical data and use the same code for processing real-time data.

Spreads is a library, not a framework, and could be plugged into existing code bases and used immediately. Even though the primary domain is financial data, Spreads was designed as a generic complex event processing library, with a performance requirement that it must be suitable for ticks and full order log processing. This is probably the largest data stream that cannot be meaningfully sharded: financial instruments are all directly or indirectly correlated and we need to monitor markets as a whole while Google/Facebook and similar user event streams could be processed independently.

Performance

Spreads library is optimized for performance and memory usage. It is several times faster than other open source projects, does not allocate memory for intermediate calculations or windows, and provides real-time incremental calculations with low-latency lock-free synchronization between data producers and consumers. You could run tests and benchmarks to see the exact numbers.

For regular keys - keys that have equal difference between them (e.g. seconds) - Spreads stores only the first key and the step size, reducing memory usage for <DateTime,T> data item by 8 bytes. So <DateTime,double> data item takes only 8 bytes inside Spreads series instead of 16. The gains of this optimization are not obvious on microbenchmarks with a single series, and one could argue that memory is cheap. However, L1/L2/L3 caches are still small, and saving 50% of memory allows to place two times more useful data in the caches and to avoid needless cache trashing.

Spreads library is written in C# and F# and targets .NET 4.5.1 and .NET Standard 1.6 versions. .NET gives native performance when optimized for memory access patterns, which means no functional data structures and minimum allocations. Even though .NET is a managed platform with garbage collection, in a steady state Spreads should not allocate many objects and create GC pressure. .NET properly supports generic value types and arrays of them are laid out contiguously in memory. Such layout enables CPUs to prefetch data efficiently, resulting in great performance boost compared to collections of boxed objects. Also .NET makes it trivial to call native methods and Spreads.Core project uses SIMD-optimized compression and math libraries written in C.

We haven't compared Spreads performance to performance of commercial systems yet (because their costs are atrocious and learning cryptic languages is not necessary). However, the main benchmark while developing Spreads was modern CPUs capabilities, not any existing product. We tried to achieve mechanical sympathy, to avoid any wasteful operations and to get the most from modern processors. Therefore, unless the fastest commercial products use magic or quantum computers, Spreads must be in the same bracket.

Series manipulation and join

Continuous and discrete series

Series could be continuous or discrete. Continuous series have values at any key, even between observed keys. For example, linear interpolation or cubic splines are continuous series defined from observed points. Another example is "last price", which is defined for any key as observed price at or before the key.

Discrete series have values only at observations/events, e.g. trade volume is meaningful only at observed trades, there is no implied latent volumes between trades. We could create a derived continuous series, e.g. let liquidity = volume.SMA(N).Repeat(), but this series changes meaning from a real observed volume to an abstract analytical indicator of average liquidity over the last N observations.

On pictures, a solid line means continuous series, dotted line means discrete series, solid blue dot means an observation, a white dot with blue outline means a calculated value of a continuous series at a key between observations.

Declarative lazy calculations

One of the core feature of Spreads library is declarative lazy series manipulation. A calculation on series is not performed until results are pulled from Series. For example, expression let incremented = series + 1.0 is not evaluated until incremented series is used. Instead, it returns a calculation definition that could be evaluated on demand.

Missing values replacement

Missing values are really missing in Spreads, not represented as a special NA or option value. When missing values are present as special values, one need to spend memory and CPU cycles to process them (and a lot of brain cycles to comprehend why missing values are somehow present, and not missing).

One of the most frequently used series transformations are Repeat and Fill. Calling them on a discrete series returns a continuous series, where for each non-existing key we could get a value from the key at or before requested key for Repeat or a given value for Fill:

let repeated = sparseSeries.Repeat()
let filled = sparseSeries.Fill(0.0)

The returned series contains infinite number of values defined for any key, but the values from non-observed keys are calculated on demand and do not take any space.

ZipN

ZipN functionality is probably the most important part in Spreads.Core and it is shown on Spreads logo. ZipN supports declarative lazy joining of N series and in many cases replaces Frames/Panels functionality and adds real-time incremental calculations over N joined series.

All binary arithmetic operations are implemented via ZipN cursor with N=2. ZipN alway produces inner join, but it is very easy to implement any complex outer join by transforming an input series from a discrete to a continuous one.

For example, imagine we have two discrete series (in pseudocode) let upper = [2=>2; 4=>4] and let lower = [1=>10; 3=>30; 5=>50] that correspond to the picture. If we add them via + operator, we will get an empty series because there are no matching keys and inner join returns an empty set. But if we repeat the upper series, we will get two items, because the repeated upper series is defined at any key:

let sum = upper.Repeat() + lower // [3=>2+30=32; 5=>4+50=54]

If we then fill the lower series with 42, we will get:

let sum = upper.Repeat() + lower.Fill(42.0) // [2=>2+42=44; 3=>2+30=32; 4=>4+42=46; 5=>4+50=54]

For N series logic remains the same. If we want to calculate a simple price index like DJIA for each tick of underlying stocks, we could take 30 tick series, repeat them (because ticks are irregular), apply ZipN and calculate average of prices at any point:

let index30 : Series<DateTime,double> =
    arrayOfDiscreteSeries
    .Map(fun ds -> ds.Repeat())
    .ZipN(fun (k:'DateTime) (vArr:'double[]) -> vArr.Average())

The values array vArr is not copied and the lambda must not return anything that has a reference to the array. If the arrays of zipped values are needed for further use outside zip method, one must copy the array inside the lambda. However, this is rarely needed, because we could zip outputs of zips and process the arrays inside lambda without allocating memory. For example, if we have series of returns and weights from applying Zip as before, these series are not evaluated until values are requested, and when we zip them to calculate SumProduct, we will only allocate two arrays of values and one array or arrays (pseudocode):

let returns = arrayOfPrices
    .Map(fun p -> p.Repeat())
    .ZipN(fun k (vArr:double[]) -> vArr)
    .ZipLag(1,(fun (cur:double[]) (prev:double[]) -> cur.Zip(prev, (fun c p -> c/p - 1.0)))) // the last zip is on arrays, must be eager
let weights = arrayOfWeights
    .Map(fun p -> p.Repeat())
    .ZipN(fun k vArr -> vArr)
let indexReturn =
    returns.ZipN(weights.Repeat(), (fun k (ret:double[]) (ws:double[]) -> SumProduct(ret, ws))

Here we violate the rule of not returning vArr, because it will be used inside lambda of ZipLag, which applies lambda to current and lagged values and does not returns references to them. But for this to be true, Zip of arrays must be eager and we will have to allocate an array to store the result. We could change the example to avoid intermediate allocations:

let returns = arrayOfPrices
    .Map(fun p -> p.Repeat())
    .ZipN(fun k (vArr:double[]) -> vArr)
    .ZipLag(1,(fun (cur:double[]) (prev:double[]) -> ValueTuple(cur,prev)))
let weights = arrayOfWeights
    .Map(fun p -> p.Repeat())
    .ZipN(fun k vArr -> vArr)
let indexReturn =
    returns.ZipN(
        weights.Repeat(),
        (fun k (ret:ValueTuple<double[],double[]>) (ws:double[]) ->
                let currentPrices : double[] = ret.Item1
                let previousPrices: double[] = ret.Item2
                let currentWeights: double[] = ws
            // imperative for loop to walk over three arrays
            // and calculate returns and sumproduct with weight
            // we need a single value and could get it in many
            // ways without copying the arrays
    )

In the last ZipN lambda we have three arrays of current and previous prices and current weights. We could calculate weighted return with them and return a single value. For each key, these arrays are refilled with new values and the last lambda is reapplied to updated arrays.

When all series are continuous, we get full outer join and the resulting series will have a union of all keys from input series, with values defined by continuous series constructor. Other than repeat/fill it could be linear or spline interpolation, a forecast from moving regression or any other complex logic that is hidden inside an input continuous series. For outside world, such a continuous series becomes defined at every point, inner join assumes that every key exists and zipping works as expected just as if we had precalculated every point. But this works without allocating memory and also works in real-time for streaming data.

Install

PM> Install-Package Spreads

Contributing

PRs & issues are welcome!

This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at http://mozilla.org/MPL/2.0/.

Status and version

Current status is alpha and we are actively working on 1.0-beta release. We will use semantic versioning after 1.0 release.

spreads.native's People

Contributors

Stargazers

Watchers

Forkers

stangelandcl

spreads.native's Issues

Vec<T>/Vec benchmark

`Vec<T>/Vec` benchmark

Benchmarking is hard... My casual benchmark initially showed very different results, but T[] bound check was definitely eliminated and probably Span's as well. Then in b861533 Vec<T> was faster.

Spreads.Benchmark.Run() results

I usually do not use BenchmarkDotNet but run my casual benchmark helper that measures time inside using(){} block like this:

                using (Benchmark.Run("VecT", count * mult))
                {
                    for (int m = 0; m < mult; m++)
                    {
                        var z = count - 1;
                        for (int j = 1; j < z; j++)
                        {
                            sum += vecT.GetUnchecked(j - 1);
                        }
                    }
                }

This bench measures running time of code regions inside a mega-method (e.g. unit test with many loops and benchmarks). All code is in the same scope for JIT and is run circularly for several cases so probably order impact is eliminated and JIT optimizations are same. Often code from one benchmark affects code from another, as was the case if I uncomment Vec.Get<T> case. However, with only three of T[]/Span<T>/Vec<T> the numbers are stable and stay the same if running each case alone.

In most cases such casual run already gives a big picture and a single-digit performance difference is not important by itself for my needs. BenchmarkDotNet requires quite a lot of ceremony compared to this simple helper and even with ShortRun runs longer than dotTrace or dotMemory. This time is better spent analyzing the profilers output. Such simple benchmark is especially useful with dotTrace unit test integration - I could just comment out what I'm not interested in and quickly profile a unit test. Then when I see that unit test runner distorts the result or I need memory profiling beyond GC counts I just run a particular unit test in a standalone console app (as in this example) and use R#->Profile->Run startup project performance or memory profiling.

The results are repeatable by running Spreads.Native.Tests.VecTests.ForEachBench from console app without debugger)

ForEachBench

Case	MOPS	Elapsed
VecT	1647.45	607 ms
Span	1567.40	638 ms
Array	1567.40	638 ms

BenchmarkDotNet results

After I claimed on Twitter that Vec<T> is faster I felt uncomfortable and I tried to run gold-standard Benchmark.Net, but numbers were clearly nonsensical at the beginning. It looks like bound checks elimination works quite well and at 2-3AM it was not straightforward to break it quickly without killing overall performance. It looks like adding if that depends on a for var and then accessing array indexer with an index depending on the for var is enough, even if this is as simple as:

                for (int j = 1; j < _count; j++)
                {
                    if (j >= 42)
                    {
                        sum += arr[j - 1];
                    }
                    else
                    {
                        sum += arr[j];
                    }
                }

Also _count is volatile. I thought that volatile should be enough and maybe it is, but with if block results look more like apples to apples.

The Benchmark output:

Note how dismal is performance of Memory<T>.Span[i] if we were using is as a field of class but require random access.

Bottom line

Logically Vec<T> is doing exactly what Span<T> does and the measured method explicitly excludes bound checks - they all must be very close by construction. Fast Span is slightly faster in BDN on platforms that support Span natively, slightly slower in my casual benchmark for some reason. Vec<T> is on par with T[] in BDN (excluding .NET Core 2.1 where T[] is much slower, but it looks suspicious) and slightly faster in my bench.

Overall they are all the same, the difference is not important because absolute performance is so fast that even a single virtual call added to the code path will completely dominate (i.e. kill) the performance.

There are few things that make Vec<T> better for our use case:

Could work with native memory - same as Span<T>.
Fast slicing without allocations - same as Span<T>.
Could be a field of a class - same as T[]. This is 100% use case so far in Spreads, underlying T[] will always be GC-rooted or pinned native memory is used.
Vec<T>.Span property exposes it as Span<T> so any API accepting span is available to use.
Non-generic Vec counterpart of Vec<T> has exactly the same binary layout and is ideal for columns storage of DataFrame-like structures (which could work on native memory, e.g. we could use R-owned memory). Vec.Get<T> is as fast as Vec<T>.Get and converting between the two is fast and trivial.
Non-generic Vec setter accepts objects but stores info about underlying type T and uses dynamic, so passing int to a Vec of longs just works. Performance impact of dynamic is quite tolerable and flexibility benefits are much greater. Vec getter/setter redirect to Vec<T> getter/setter via ldftn + calli and this is the dominating cost vs Vec<T> all-inlined path.

I really hope that coreclr adds non-ref counterpart of Span<T> rather sooner (3.0 maybe?) because it absence (before reinventing Vec<T>/Vec) blocked important use case for fast data analysis.

cc @adamsitnik @jkotas

P.S. I have an important question left for coreclr gurus: ELI5 what was wrong with slice.net Slice<T> being non ref struct other than potential GC problems? Is it safe to use Vec<T>/Slice<T> if the underlying array is always GC-rooted? IL code is here.

NuGet trargets should check platform

When need false sense of small accomplishment filter nuget targets by OS. It's completely not a big deal or not a deal at all. Small annoyance that native libs hang in VS solution explorer.

We only need .so on Windows for Docker debugging.

This was already done here: https://github.com/Spreads/Spreads.Native/blob/208406d06093b8c0a0be5a5df5a498de5a21cbec/dotnet/lib/native/Spreads.Native.targets

Also see the CefSharp issue for that annoyance, we have just a few of libs compared to them but even they couldn't find a solution. cefsharp/CefSharp#2456

Wasted a lot of time on that, won't repack for this issue until future real code changes.

MacOS support

MacOS x64 is probably the only thing remaining TBD here for 1.0. Will do when the entire stack is ready.

Bootstrapper review

The code it from ancient Yeppp librray C# wrapper. Very similar code was in R loader, so maybe it is from somewhere else.

Bootsrapper needs to be configurable on startup:

Static field with location. All libs should be tried to load from there and then extracted to there.
Versioning could be external to resources: the static config method should accept an optional number, then on the first load we save it to a file and later check: if the number matches we just open libs at the location. Otherwise replace them. The flie with version should also be used as a file lock and app should exit if the version are different but the file is locked.
Need to search GH once again for a better solution.

`Vec<T>` design

Original version was "unsafe" unsafe. But we do not need ref T from indexer for the hottest path, just simple get/set. For that unsafe is probably safe enough on the stack and returned T value is safe to use.

If this is not the case then we just have the unchecked method and could use it as a field. The check if (_pinnable == null) is inevitable at some point without JIT native support if we need both native/managed memory. But saving on BC compensates this cost and we should be at least on par with T[] when used e.g. from SortedMap. Benefits on non-generic Vec remain.

Code is based on Span.Portable as of last commit it was in corefx https://github.com/dotnet/corefx/tree/64c6d9fe5409be14bdc3609d73ffb3fea1f35797

Add methods:

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public T GetUnchecked(int index)
        {
            if (typeof(T) == typeof(int))
            {
                return UnsafeEx.GetAtIndex<T>(_pinnable, _byteOffset, index);
            }

            // Other known types

            return GetRefUnchecked(index);
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public ref T GetRefUnchecked(int index)
        {
            if (_pinnable == null)
            {
                unsafe { return ref Unsafe.Add<T>(ref Unsafe.AsRef<T>(_byteOffset.ToPointer()), index); }
            }
            else
            {
                return ref Unsafe.Add<T>(ref Unsafe.AddByteOffset<T>(ref _pinnable.Data, _byteOffset), index);
            }
        }

GetRefUnchecked if original Span.Portable code without bound checks.

UnsafeEx.GetAtIndex<T> IL is:

	.method hidebysig static !!T GetAtIndex<T>(
		object obj, native int offset, int32 index) cil managed aggressiveinlining
	{
		.maxstack 3
        .locals([0] uint8 & addr)
        ldarg.0     // load the object
        stloc.0     // convert the object pointer to a byref
        ldloc.0     // load the object pointer as a byref
        ldarg.1     // load the offset
        add         // add the offset
        ldarg.2     // load the index
        sizeof !!T  // load size of T
        conv.i
        mul         // multiply the index and size of T
        add         // add the result
        ldobj !!T   // load a T value from the computed address
        ret
	}

To review:

Maybe this typeof(T) == typeof(int) is not a big deal, but it's free for most important types e.g. double
Doing binary search by first converting to Span may be faster, or we could just use the code from there directly.
For blittables the fast path should be safe, did so many times and this was even confirmed by JK, but is this non-ref return from simple getter is still unsafe for types with references? I don't understand the detail: if the dereferencing itself is unsafe e.g. because GC could move the object or using the object after the method returns is unsafe. The first version of Vec<T> already used simple non-ref return. Is this ldobj !!T operation does not notify GC that we are using the object and that is why the reference is invalid and won't be updated when the object moves?

WIP & TBD

Move Vec<T> to Spreads.Core

All helpers will stay here.

We need to account for all known types. Spreads.DataTypes will have them all (inclusing those defined in top Spreads project, will move them to Spreads.DataTypes.Financial.

But only those types that are truly general. E.g. trading-related structs must support Ethereum 18 digit precision and World GDP in Iranian rials. Such universal types should also get their own TypeEnum.

Alternatively, TimeStamp, SmallDecimal and others could be moved here, but the namespace is wrong. And having root Spreads NS in this proj breaks docs. Moving to core is better.

Add UnsafeEx.ReadVolatile<T>

Just add .volatile the same way .unaligned.

Unaligned volatile does not make sense and is an error.

This is for convenience to read 4/8 byte structs that we now need to cast after volatile reading ints/uints 32/64.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.