fslaborg / deedle Goto Github PK

View Code? Open in Web Editor NEW

915.0 915.0 196.0 38.33 MB

Easy to use .NET library for data and time series manipulation and for scientific programming

Home Page: http://fslab.org/Deedle/

License: BSD 2-Clause "Simplified" License

Shell 0.01% F# 94.55% C# 0.92% HTML 2.17% Batchfile 0.01% PowerShell 0.03% Jupyter Notebook 1.66% Forth 0.66%

deedle's People

Contributors

Stargazers

Watchers

Forkers

tpetricek colinbull kos59125 andy-p kziemski utlww bohdanszymanik adamklein cdrnet forki pergamonster tonyabell vsmida yukitos calwi stic atifaziz atwoodtm tcopple johannh-zz biswapanda zuiwanting dcharbon applied-duality codingday r0k3 marklam tush1r linearregression devjuice hijeenu dsyme buybackoff ammachado ascjones dilico patrickmcdonald alexracoon casbby terencecraig jeyoor chris-b1 fayssalm modulexcite telefunkenvf14 ovatsus tleviathan filmor dependencies augustoproiete-forks evilpepperman dsimba rikace robertpi denmerc esirola transformersprimeabcxyz yagilofir chrisharding smartcaveman yenyenx aabbcczz msitt croland sandboxorg arshbucks jweibel22 kimserey benkalegin soundarkarunagaran kflu davydovpv ingted romanshestakov philipjadler sigino mjul ahorjia intellibrain munik kostrse rmunn huguojunsy rjshaver spreads cutelittle marioquillas yukibonji valmac andrewrothstein georgemasonopensource stoneflyop1 deepakkumar1984 jlw109 sebhofer huangzhengyong shalokshalom arvidjb sksundaram-learning kblohm

deedle's Issues

Consider better story for building frames

Do we need some sort of computation builder for creating frames & series?

C# documentation

Provide some examples of using data frame from C#

Improve exceptions

It would be helpful if duplicate key exceptions reported the duplicate key, otherwise its painful to figure out what happened

System.ArgumentException: Duplicate keys are not allowed in the index.
Parameter name: keys
at Microsoft.FSharp.Core.Operators.Raise[T](Exception exn)
at FSharp.DataFrame.Indices.Linear.LinearIndex1..ctor(IEnumerable1 keys, IIndexBuilder builder, FSharpOption`1 ordered) in C:\dev\FSharp.DataFrame\src\Indices\LinearIndex.fs:line 50

Consider adding query builder

We could support something like this:

frame { for r in frame do
        indexRowsString "Name"
        shift 1
        window 5 into win 
        select ... }

Query support

Provide query builder for series/frame.

Not entirely clear what could be supported, but we could certainly add something!

Fix and document functions

Like this one:

let inline filterCols f (frame:Frame<'TColumnKey, 'C>) = 
  frame.Columns |> Series.filter f |> FrameUtils.fromColumns

Interpolation functions

Add functions for interpolating missing values in a series (or better, build a function that can calculate values for keys not in series - and also can be used in FillMissing)

Support C# dynamic

On series & frame for getting/adding values and series. Also on Series builder, tests currently fail.

Plugin for R provider

View source button in docs

... to go the fsx file in generated documentation.

General feedback

Some time ago I did a small comparison of some R code (https://gist.github.com/ovatsus/5354187#file-original-r), with the equivalent code using CsvProvider (https://gist.github.com/ovatsus/5354187#file-csvprovider-fsx) and using the untyped CsvFile (https://gist.github.com/ovatsus/5354187#file-csvfile-fsx)

I did the same for DataFrame (https://gist.github.com/ovatsus/5354187#file-dataframe-fsx), and I have some feedback:

Line 45: I needed to filter the data to where Ozone > 31 and Temp > 90, but some rows didn't have the Ozone value. I was forced to fill the missing values with zeros to workaround it, is there a better way to do this? Maybe the dynamic operator on a series could return a Double.NaN when the value is missing?
Line 52: I was expecting frame |> Frame.getCol "Solar.R" to be the same as frame.["Solar.R"], but the first one gives this error: Type constraint mismatch when applying the default type 'int' for a type inference variable. The type 'int' does not support the operator 'DivideByInt' Consider adding further type constraints
Line 61: Is there any better way to do this than iris |> Frame.filterRowValues (fun x -> x.GetAs<string>("Species") = "virginica")? It feels very long for such a simple operation

Simplify setting index column.

Using frame.WithRowIndex<T>(...) is ugly.

Support Frame.indexWithDate "foo" and Frame.indexWithInt "foo" (and a few standard things) - and similarly for member methods (maybe)?

Series and Frames for real-time streaming data

What would be the right way to use Series in a real-time environment where new data arrive asynchronously?

I have found a question (and probably a part of an answer) that describes exactly the idea. http://stackoverflow.com/questions/17941932/f-immutable-data-structures-for-high-frequency-real-time-streaming-data

The answers on SO suggest using FSharpx.Collections.Vector<T> data structure instead of arrays. Another answer (http://stackoverflow.com/a/19520214/801189) on SO by @tpetricek explains why arrays are faster than lists for fixed data, and I believe that was one of the reason for initial implementation of Vector as ArrayVector in Deedle. I think the current focus of Deedle is to deal with fixed existing data series and frame - the workflow much similar to R. But if the data length is fixed then the performance is less important that in a real-time environment.

For streaming data we need to append existing series with new value(s) and use the new series. With current array implementation that will require copying the whole old array to the new resized array. In the first question the author mentions 5 mn data point per instrument per day (let's assume 8 bytes double + DateTime's 8 bytes), or around 80 Mb per instrument. With e.g. 100 instruments copying all arrays many times per second is probably not the best option.

Simplest use case
For stock price A with 1 second interval we calculate 60-second moving average and store it in a series MA_A_60. We update all vectors as new data points arrive.

For a new price point we create a new series object by appending the old object (in the case of a very large data set copying array is slow)
Then we take last 60 values from the new series object and calculate new MA value (crucial point is to avoid recalculation for all MA values, but take only the last 60-point window from A)
We append new MA value to the MA_A_60.

Will the current implementation be suitable for such workflow for hundreds of instruments, multiple calculated values for each one and sub-second frequency?

Will an implementation of Deedle's IVector with FSharpx.Collections.Vector be more suitable for such use case? (I know one should run some tests in a similar situation, but there is no second implementation to compare with)

I would love to have Deedle's abstraction and API for such use case!

P.S. An abstraction of the workflow: if seriesB = f seriesA, then we could somehow link series B to series A, watch for new values in A and add the new values to B (applying f function only for incremental data). For this we would need some projection object that would keep seriesB always synchronized with seriesA using the transformation function f. In turn, there could be some seriesC = f2 seriesB on so on. I am not sure that this functionality should be inside the library, but that is what I hope to achieve.

Customizable question mark

Automatically casting to Series<'K, float> when writing frame?ColName is not ideal for all applications. Perhaps make it possible to change the behaviour by opening a different namespace?

Add Series.fold, Series.reduce, Series.scan etc.

Handle CSV files with missing column keys

For example, given the following CSV file:

a,,
1,2,3 
1,2,3
1,2,3

The Frame.ReadCsv function fails. It should instead generate some names for the unlabeled columns.

Series.Join with JoinKind.Outer ignores lookup option

Currently lookup option only works for left or right joins on series, but doesn't work with outer join. The more expected behavior would be to lookup values on both sides respectively. Could we do this?

An example use case could be ticks for two stocks that arrive at different time while the ratio of the two of them should be updated with every new information.

Another example is exchanges schedule in Israel/Gulf countries (closed on Fri, Sat) and Western exchanges (closed on Sat, Sun). On Fri one would want to get Thu data from Israel/GCC, but on Sun the Fri data from West.

In both cases one would need outer join with Lookup.NearestSmaller.

Deedle DataFrame with sliced columns to R conversion exception.

I gets an exception when converts Deedle data frame to R (Frame with sliced columns)
This is script:

#I "..\\packages\\Deedle.0.9.11-beta"
#I "..\\packages\\RProvider.1.0.4"
#load "RProvider.fsx"
#load "Deedle.fsx"

open Deedle
open RDotNet
open RProvider
open RProvider.``base``
open RProvider.datasets

let mtcars : Frame<string, string> = R.mtcars.GetValue()
let mtcars' = mtcars.Columns.[["vs";"am";"gear";"carb"]]
R.as_data_frame(mtcars)  // works
R.as_data_frame(mtcars') // fails

Error message:

System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> System.Exception: No converter registered for type System.Object[] or any of its base types
   at [email protected](String message) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 102
   at RProvider.RInteropInternal.REngine.SetValue(REngine this, Object value, FSharpOption`1 symbolName) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 212
   at RProvider.RInteropInternal.toR(Object value) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 225
   at RProvider.RInterop.passArg@312(List`1 tempSymbols, Object arg) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 326
   at [email protected](IEnumerable`1& next) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 334
   at Microsoft.FSharp.Core.CompilerServices.GeneratedSequenceBase`1.MoveNextImpl()
   at Microsoft.FSharp.Core.CompilerServices.GeneratedSequenceBase`1.System-Collections-IEnumerator-MoveNext()
   at Microsoft.FSharp.Collections.SeqModule.ToArray[T](IEnumerable`1 source)
   at RProvider.RInterop.callFunc(String packageName, String funcName, IEnumerable`1 argsByName, Object[] varArgs) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 331
   at <StartupCode$Deedle-RProvider-Plugin>.$Exports.RProvider-IConvertToR-1-Convert@21.Deedle-IFrameOperation`1-Invoke[a,b](Frame`2 )
   --- End of inner exception stack trace ---
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Object[] arguments, Signature sig, Boolean constructor)
   at System.Reflection.RuntimeMethodInfo.UnsafeInvokeInternal(Object obj, Object[] parameters, Object[] arguments)
   at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
   at [email protected](a engine, b value) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 98
   at RProvider.RInteropInternal.REngine.SetValue(REngine this, Object value, FSharpOption`1 symbolName) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 212
   at RProvider.RInteropInternal.toR(Object value) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 225
   at RProvider.RInterop.passArg@312(List`1 tempSymbols, Object arg) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 326
   at [email protected](IEnumerable`1& next) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 334
   at Microsoft.FSharp.Core.CompilerServices.GeneratedSequenceBase`1.MoveNextImpl()
   at Microsoft.FSharp.Core.CompilerServices.GeneratedSequenceBase`1.System-Collections-IEnumerator-MoveNext()
   at Microsoft.FSharp.Collections.SeqModule.ToArray[T](IEnumerable`1 source)
   at RProvider.RInterop.callFunc(String packageName, String funcName, IEnumerable`1 argsByName, Object[] varArgs) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 331
   at RProvider.RInterop.call(String packageName, String funcName, String serializedRVal, Object[] namedArgs, Object[] varArgs) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 375
   at <StartupCode$FSI_0008>.$FSI_0008.main@()

Float to string conversion when framing a serie

Hi,
I am new to F#. I bumped in a strange issue experimenting with Deedle, which I posted here. I'd appreciate if you could help me out with this.

http://stackoverflow.com/questions/19795949/map-to-deedle-frame/19796225#19796225

Thanks.

Index that avoids duplicates

Add IIndex type that automatically avoids duplicate-key errors (e.g. when appending data frames that have an ordinal index, the index should be re-calculated)

Add Frame.empty and Series.empty

Naming

Don suggests renaming Series.ofObservations to something else (like Series.ofPairs). I think "observations" is a bit too long, so I agree ... not entirely sure what would the best name be.. "pair" sounds okay, but maybe not ideal.

Output CSV/TSV & Serialization in general

Support serializing data frames & series and writing them to CSV/TSV.

Rows with missing values

Add function to get rows with some missing values (useful for diagnostics...)

Type provider

Add type provider that can be used for creating statically typed data frames.

Synchronize extensions and module functions

They should cover the same functionality.

Join DF and Series

Add overload taking series...

Hierarchical indexing

Consider & add support for hierarchical indexing...

Expose types of columns

Internally, columns are stored as values of type IVector<V> and the type of V matters sometimes (e.g. when passing data to R provider).

The Print operation should say what the types are and we need some functions to convert those if they are incorrect.

(But also, slicing should preserve these types...)

Frame.ReadCsv throws an exception with web streams

I have tried to retrieve a data from web, but I got an error.

open System.Net
open Deedle

let irisDataUri = "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
let iris =
   let request = WebRequest.Create (irisDataUri)
   use response = request.GetResponse ()
   use stream = response.GetResponseStream ()
   Frame.ReadCsv (stream, false)

Unhandled Exception: System.NotSupportedException: This stream does not support seek operations.

Flattening via reflection

flatten nested data structures

Add Frame.foo version of df.RenameSeries

The operation is currently only available as a member (mutating the frame). We should add non-mutating module operation.

.NET 4.0 support

This is a great and long awaited library, but could it target .NET 4.0?

It is quite trivial to replace IReadOnlyList by ReadOnlyCollection (done here https://github.com/buybackoff/FSharp.DataFrame/commit/7e0b84c096ab3a27a55fc0658c832555cd65f269, all tests pass).

However there are modules FrameUtils and FrameExtentions that are tightly coupled with FSharp.Data.DesignTime for type inference from TextReader. Then the method ReadCSV is used from tests, but the data supplied is a .csv file. As I understand, runtime FSharp.Data could infer types from sample files, but in FrameUtils the data is supplied as TextReader.

This SO question says one doesn't nees DesignTime reference and could delete it, but not in this case. http://stackoverflow.com/questions/19214044/is-fsharp-data-designtime-net-4-5-only

Probably .CSV parsing utility and extensions should not be a part of the DataFrame itself, but reside in tests or samples? I am quite happy with Frame constructor only and could easily construct columns myself and use the constructor like on the last line in FrameUtils: Frame(rowIndex, columnIndex, Vector.ofValues columns).

IFsiFormattable

Have you considered instead of having the IFsiFormattable interface and then registering an fsi printer for it, just turning the Format method into a property and use [<StructuredFormatDisplay("{Format}")>]?

That way we wouldn't need to have #load "FSharp.DataFrame.fsx", just #r "FSharp.DataFrame.dll".

This would be especially helpful with the new "Send to FSI" command in VS2013, making it a better experience: just reference the dll and be ready to go.

The problem with this is that the Format member would appear in IntelliSense, but it
could be hidden from C# by using [<EditorBrowsable(EditorBrowsableState.Never)>] and from F# by using [<CompilerMessage("This method is intended to be used only by FSI Printer", 10002, IsHidden=true, IsError=false)>]

(Another simpler option would be just to use the ToString, which would be even nicer for C# users)

Diff data frames/series

Given two series, we want to know how they differ. That is, find which keys are available in one, but not in the other and find the keys for which they both have values, but the values differ.

This would be very useful for interactive exploration - when you get two data frames or two series and want to quickly check how they differ (e.g. when they represent two versions of the same data set).

For example, say we have the following two series:

let s1 = series [ 1 => 1.0; 2 => 2.0; 3 => 3.0 ]
let s2 = series [ 1 => 10.0; 2 => 2.0; 4 => 4.0 ]

The difference could be described using a simple discriminated union, something like this:

type Diff<'T> = 
  | Change of 'T * 'T 
  | Remove of 'T 
  | Add of 'T
  override x.ToString() =
    match x with 
    | Change(a, b) -> sprintf "%A -> %A" a b 
    | Remove v -> sprintf "-%A" v | Add v -> sprintf "+%A" v

And Series.compare a b would return something like this:

series [1 => Change(1.0, 10.0); 3 => Remove 3.0; 4 => Add 4.0 ]

Comparing frames could work in a similar way...

Support loading a csv directly from an URL into a Frame

It would be nice to be able to do:

let frame = Frame.ReadCsv "http://faculty.washington.edu/heagerty/Books/Biostatistics/DATA/ozone.csv"

similarly to what can be done with the CsvProvider

Performance of CSV reader

Make sure the CSV reader is fast... (pandas default CSV reader can handle some 10k rows, but more is slow?)

Improve behaviour of `nestBy`

In the current version, the function takes just a projection:

df |> Frame.nestBy fst

Given Frame<R1 * R2, C>, this produces Series<R1, Frame<R1 * R2, C>> but it would be more reasonable to produce Series<R1, Frame<R2, C>>. To do that, we would have to take a pair of functions rather than fst.

Also, rename this to nestRowsBy and add nestColsBy.

Simplify lazy loading

Should not need midpoint argument.

Finish walkthroughs & sample scripts

Throw when using Lookup with Join on unordered series

(because that does not make sense)

Suggestion: when using Frame.ofRecords and there's a single record field of type Date, automatically use that as the key

Frame.sum should not fail

Frame.sum on a frame that contains columns with non-numeric data should not throw. It should return a series with missing values or drop the columns.

Missing FSharp.Data dependency on .nuspec

Frame.ReadCsv depends on it

Add packages.config for samples

Should get F# Charting....

Stack & unstack

Review and do the standard R thing

R plugin - data frame columns

This commit (ef65df7) tried to fix an error where passing Deedle data frame to R would fail.

However, the problem isn't the size of the data frame, but instead, the column keys - the operation R.data_frame fails when the column names are not valid R identifiers. The $<- operation can handle that, because it takes the name as a string (not as a named param).

We should probably build data frames using $<- unless that is slower.

Documentation small issue

The font used in the docs for the tooltips has a problem: 0 and o are very hard to distinguish (tested both in Chrome and IE on Windows).
For example, in http://bluemountaincapital.github.io/FSharp.DataFrame/tutorial.html, in the second code block, in the first (...), where the tooltip is seq { for i in 0 .. (count - 1) ->
I was honestly confused and looking for the definition of the o variable :)
I suggest the same font from the code is used instead.

Improve testing

Add more tests for series, etc. based on F# interactive scripts.

Generated docs from XML comments

The comments are written in Markdown, so this needs to be transformed first. Then we need to generate nice doc page from it...