Code Monkey home page Code Monkey logo

deedle's People

Contributors

adamklein avatar alexpantyukhin avatar andrewiom avatar arlofin avatar buybackoff avatar cdrnet avatar choc13 avatar dcharbon avatar dsyme avatar eiriktsarpalis avatar elanhasson avatar foggyfinder avatar forki avatar isaacabraham avatar jozefizso avatar kblohm avatar kmutagene avatar marklam avatar markpflug avatar patrickmcdonald avatar pkese avatar reedcopsey avatar rmunn avatar shalokshalom avatar shofergvc avatar tom162 avatar tpetricek avatar whiteblackgoose avatar zieg avatar zyzhu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deedle's Issues

Improve exceptions

It would be helpful if duplicate key exceptions reported the duplicate key, otherwise its painful to figure out what happened

System.ArgumentException: Duplicate keys are not allowed in the index.
Parameter name: keys
at Microsoft.FSharp.Core.Operators.Raise[T](Exception exn)
at FSharp.DataFrame.Indices.Linear.LinearIndex1..ctor(IEnumerable1 keys, IIndexBuilder builder, FSharpOption`1 ordered) in C:\dev\FSharp.DataFrame\src\Indices\LinearIndex.fs:line 50

Consider adding query builder

We could support something like this:

frame { for r in frame do
        indexRowsString "Name"
        shift 1
        window 5 into win 
        select ... } 

Query support

Provide query builder for series/frame.

Not entirely clear what could be supported, but we could certainly add something!

Fix and document functions

Like this one:

let inline filterCols f (frame:Frame<'TColumnKey, 'C>) = 
  frame.Columns |> Series.filter f |> FrameUtils.fromColumns

Interpolation functions

Add functions for interpolating missing values in a series (or better, build a function that can calculate values for keys not in series - and also can be used in FillMissing)

Support C# dynamic

On series & frame for getting/adding values and series. Also on Series builder, tests currently fail.

General feedback

Some time ago I did a small comparison of some R code (https://gist.github.com/ovatsus/5354187#file-original-r), with the equivalent code using CsvProvider (https://gist.github.com/ovatsus/5354187#file-csvprovider-fsx) and using the untyped CsvFile (https://gist.github.com/ovatsus/5354187#file-csvfile-fsx)

I did the same for DataFrame (https://gist.github.com/ovatsus/5354187#file-dataframe-fsx), and I have some feedback:

  • Line 45: I needed to filter the data to where Ozone > 31 and Temp > 90, but some rows didn't have the Ozone value. I was forced to fill the missing values with zeros to workaround it, is there a better way to do this? Maybe the dynamic operator on a series could return a Double.NaN when the value is missing?
  • Line 52: I was expecting frame |> Frame.getCol "Solar.R" to be the same as frame.["Solar.R"], but the first one gives this error: Type constraint mismatch when applying the default type 'int' for a type inference variable. The type 'int' does not support the operator 'DivideByInt' Consider adding further type constraints
  • Line 61: Is there any better way to do this than iris |> Frame.filterRowValues (fun x -> x.GetAs<string>("Species") = "virginica")? It feels very long for such a simple operation

Simplify setting index column.

Using frame.WithRowIndex<T>(...) is ugly.

Support Frame.indexWithDate "foo" and Frame.indexWithInt "foo" (and a few standard things) - and similarly for member methods (maybe)?

Series and Frames for real-time streaming data

What would be the right way to use Series in a real-time environment where new data arrive asynchronously?

I have found a question (and probably a part of an answer) that describes exactly the idea. http://stackoverflow.com/questions/17941932/f-immutable-data-structures-for-high-frequency-real-time-streaming-data

The answers on SO suggest using FSharpx.Collections.Vector<T> data structure instead of arrays. Another answer (http://stackoverflow.com/a/19520214/801189) on SO by @tpetricek explains why arrays are faster than lists for fixed data, and I believe that was one of the reason for initial implementation of Vector as ArrayVector in Deedle. I think the current focus of Deedle is to deal with fixed existing data series and frame - the workflow much similar to R. But if the data length is fixed then the performance is less important that in a real-time environment.

For streaming data we need to append existing series with new value(s) and use the new series. With current array implementation that will require copying the whole old array to the new resized array. In the first question the author mentions 5 mn data point per instrument per day (let's assume 8 bytes double + DateTime's 8 bytes), or around 80 Mb per instrument. With e.g. 100 instruments copying all arrays many times per second is probably not the best option.

Simplest use case
For stock price A with 1 second interval we calculate 60-second moving average and store it in a series MA_A_60. We update all vectors as new data points arrive.

  1. For a new price point we create a new series object by appending the old object (in the case of a very large data set copying array is slow)
  2. Then we take last 60 values from the new series object and calculate new MA value (crucial point is to avoid recalculation for all MA values, but take only the last 60-point window from A)
  3. We append new MA value to the MA_A_60.

Will the current implementation be suitable for such workflow for hundreds of instruments, multiple calculated values for each one and sub-second frequency?

Will an implementation of Deedle's IVector with FSharpx.Collections.Vector be more suitable for such use case? (I know one should run some tests in a similar situation, but there is no second implementation to compare with)

I would love to have Deedle's abstraction and API for such use case!

P.S. An abstraction of the workflow: if seriesB = f seriesA, then we could somehow link series B to series A, watch for new values in A and add the new values to B (applying f function only for incremental data). For this we would need some projection object that would keep seriesB always synchronized with seriesA using the transformation function f. In turn, there could be some seriesC = f2 seriesB on so on. I am not sure that this functionality should be inside the library, but that is what I hope to achieve.

Customizable question mark

Automatically casting to Series<'K, float> when writing frame?ColName is not ideal for all applications. Perhaps make it possible to change the behaviour by opening a different namespace?

Handle CSV files with missing column keys

For example, given the following CSV file:

a,,
1,2,3 
1,2,3
1,2,3

The Frame.ReadCsv function fails. It should instead generate some names for the unlabeled columns.

Series.Join with JoinKind.Outer ignores lookup option

Currently lookup option only works for left or right joins on series, but doesn't work with outer join. The more expected behavior would be to lookup values on both sides respectively. Could we do this?

An example use case could be ticks for two stocks that arrive at different time while the ratio of the two of them should be updated with every new information.

Another example is exchanges schedule in Israel/Gulf countries (closed on Fri, Sat) and Western exchanges (closed on Sat, Sun). On Fri one would want to get Thu data from Israel/GCC, but on Sun the Fri data from West.

In both cases one would need outer join with Lookup.NearestSmaller.

Deedle DataFrame with sliced columns to R conversion exception.

I gets an exception when converts Deedle data frame to R (Frame with sliced columns)
This is script:

#I "..\\packages\\Deedle.0.9.11-beta"
#I "..\\packages\\RProvider.1.0.4"
#load "RProvider.fsx"
#load "Deedle.fsx"

open Deedle
open RDotNet
open RProvider
open RProvider.``base``
open RProvider.datasets

let mtcars : Frame<string, string> = R.mtcars.GetValue()
let mtcars' = mtcars.Columns.[["vs";"am";"gear";"carb"]]
R.as_data_frame(mtcars)  // works
R.as_data_frame(mtcars') // fails

Error message:

System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> System.Exception: No converter registered for type System.Object[] or any of its base types
   at [email protected](String message) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 102
   at RProvider.RInteropInternal.REngine.SetValue(REngine this, Object value, FSharpOption`1 symbolName) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 212
   at RProvider.RInteropInternal.toR(Object value) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 225
   at RProvider.RInterop.passArg@312(List`1 tempSymbols, Object arg) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 326
   at [email protected](IEnumerable`1& next) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 334
   at Microsoft.FSharp.Core.CompilerServices.GeneratedSequenceBase`1.MoveNextImpl()
   at Microsoft.FSharp.Core.CompilerServices.GeneratedSequenceBase`1.System-Collections-IEnumerator-MoveNext()
   at Microsoft.FSharp.Collections.SeqModule.ToArray[T](IEnumerable`1 source)
   at RProvider.RInterop.callFunc(String packageName, String funcName, IEnumerable`1 argsByName, Object[] varArgs) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 331
   at <StartupCode$Deedle-RProvider-Plugin>.$Exports.RProvider-IConvertToR-1-Convert@21.Deedle-IFrameOperation`1-Invoke[a,b](Frame`2 )
   --- End of inner exception stack trace ---
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Object[] arguments, Signature sig, Boolean constructor)
   at System.Reflection.RuntimeMethodInfo.UnsafeInvokeInternal(Object obj, Object[] parameters, Object[] arguments)
   at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
   at [email protected](a engine, b value) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 98
   at RProvider.RInteropInternal.REngine.SetValue(REngine this, Object value, FSharpOption`1 symbolName) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 212
   at RProvider.RInteropInternal.toR(Object value) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 225
   at RProvider.RInterop.passArg@312(List`1 tempSymbols, Object arg) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 326
   at [email protected](IEnumerable`1& next) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 334
   at Microsoft.FSharp.Core.CompilerServices.GeneratedSequenceBase`1.MoveNextImpl()
   at Microsoft.FSharp.Core.CompilerServices.GeneratedSequenceBase`1.System-Collections-IEnumerator-MoveNext()
   at Microsoft.FSharp.Collections.SeqModule.ToArray[T](IEnumerable`1 source)
   at RProvider.RInterop.callFunc(String packageName, String funcName, IEnumerable`1 argsByName, Object[] varArgs) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 331
   at RProvider.RInterop.call(String packageName, String funcName, String serializedRVal, Object[] namedArgs, Object[] varArgs) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 375
   at <StartupCode$FSI_0008>.$FSI_0008.main@()

Index that avoids duplicates

Add IIndex type that automatically avoids duplicate-key errors (e.g. when appending data frames that have an ordinal index, the index should be re-calculated)

Naming

Don suggests renaming Series.ofObservations to something else (like Series.ofPairs). I think "observations" is a bit too long, so I agree ... not entirely sure what would the best name be.. "pair" sounds okay, but maybe not ideal.

Type provider

Add type provider that can be used for creating statically typed data frames.

Expose types of columns

Internally, columns are stored as values of type IVector<V> and the type of V matters sometimes (e.g. when passing data to R provider).

The Print operation should say what the types are and we need some functions to convert those if they are incorrect.

(But also, slicing should preserve these types...)

Frame.ReadCsv throws an exception with web streams

I have tried to retrieve a data from web, but I got an error.

open System.Net
open Deedle

let irisDataUri = "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
let iris =
   let request = WebRequest.Create (irisDataUri)
   use response = request.GetResponse ()
   use stream = response.GetResponseStream ()
   Frame.ReadCsv (stream, false)
Unhandled Exception: System.NotSupportedException: This stream does not support seek operations.

.NET 4.0 support

This is a great and long awaited library, but could it target .NET 4.0?

It is quite trivial to replace IReadOnlyList by ReadOnlyCollection (done here https://github.com/buybackoff/FSharp.DataFrame/commit/7e0b84c096ab3a27a55fc0658c832555cd65f269, all tests pass).

However there are modules FrameUtils and FrameExtentions that are tightly coupled with FSharp.Data.DesignTime for type inference from TextReader. Then the method ReadCSV is used from tests, but the data supplied is a .csv file. As I understand, runtime FSharp.Data could infer types from sample files, but in FrameUtils the data is supplied as TextReader.

This SO question says one doesn't nees DesignTime reference and could delete it, but not in this case. http://stackoverflow.com/questions/19214044/is-fsharp-data-designtime-net-4-5-only

Probably .CSV parsing utility and extensions should not be a part of the DataFrame itself, but reside in tests or samples? I am quite happy with Frame constructor only and could easily construct columns myself and use the constructor like on the last line in FrameUtils: Frame(rowIndex, columnIndex, Vector.ofValues columns).

IFsiFormattable

Have you considered instead of having the IFsiFormattable interface and then registering an fsi printer for it, just turning the Format method into a property and use [<StructuredFormatDisplay("{Format}")>]?

That way we wouldn't need to have #load "FSharp.DataFrame.fsx", just #r "FSharp.DataFrame.dll".

This would be especially helpful with the new "Send to FSI" command in VS2013, making it a better experience: just reference the dll and be ready to go.

The problem with this is that the Format member would appear in IntelliSense, but it
could be hidden from C# by using [<EditorBrowsable(EditorBrowsableState.Never)>] and from F# by using [<CompilerMessage("This method is intended to be used only by FSI Printer", 10002, IsHidden=true, IsError=false)>]

(Another simpler option would be just to use the ToString, which would be even nicer for C# users)

Diff data frames/series

Given two series, we want to know how they differ. That is, find which keys are available in one, but not in the other and find the keys for which they both have values, but the values differ.

This would be very useful for interactive exploration - when you get two data frames or two series and want to quickly check how they differ (e.g. when they represent two versions of the same data set).

For example, say we have the following two series:

let s1 = series [ 1 => 1.0; 2 => 2.0; 3 => 3.0 ]
let s2 = series [ 1 => 10.0; 2 => 2.0; 4 => 4.0 ]

The difference could be described using a simple discriminated union, something like this:

type Diff<'T> = 
  | Change of 'T * 'T 
  | Remove of 'T 
  | Add of 'T
  override x.ToString() =
    match x with 
    | Change(a, b) -> sprintf "%A -> %A" a b 
    | Remove v -> sprintf "-%A" v | Add v -> sprintf "+%A" v

And Series.compare a b would return something like this:

series [1 => Change(1.0, 10.0); 3 => Remove 3.0; 4 => Add 4.0 ]

Comparing frames could work in a similar way...

Performance of CSV reader

Make sure the CSV reader is fast... (pandas default CSV reader can handle some 10k rows, but more is slow?)

Improve behaviour of `nestBy`

In the current version, the function takes just a projection:

df |> Frame.nestBy fst 

Given Frame<R1 * R2, C>, this produces Series<R1, Frame<R1 * R2, C>> but it would be more reasonable to produce Series<R1, Frame<R2, C>>. To do that, we would have to take a pair of functions rather than fst.

Also, rename this to nestRowsBy and add nestColsBy.

Frame.sum should not fail

Frame.sum on a frame that contains columns with non-numeric data should not throw. It should return a series with missing values or drop the columns.

R plugin - data frame columns

This commit (ef65df7) tried to fix an error where passing Deedle data frame to R would fail.

However, the problem isn't the size of the data frame, but instead, the column keys - the operation R.data_frame fails when the column names are not valid R identifiers. The $<- operation can handle that, because it takes the name as a string (not as a named param).

We should probably build data frames using $<- unless that is slower.

Documentation small issue

The font used in the docs for the tooltips has a problem: 0 and o are very hard to distinguish (tested both in Chrome and IE on Windows).
For example, in http://bluemountaincapital.github.io/FSharp.DataFrame/tutorial.html, in the second code block, in the first (...), where the tooltip is seq { for i in 0 .. (count - 1) ->
I was honestly confused and looking for the definition of the o variable :)
I suggest the same font from the code is used instead.

Improve testing

Add more tests for series, etc. based on F# interactive scripts.

Generated docs from XML comments

The comments are written in Markdown, so this needs to be transformed first. Then we need to generate nice doc page from it...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.