acowley / frames Goto Github PK

Data frames for tabular data.

License: Other

Haskell 96.76% Python 1.12% Shell 0.65% Nix 1.47%

frames's Introduction

Frames

Data Frames for Haskell

User-friendly, type safe, runtime efficient tooling for working with tabular data deserialized from comma-separated values (CSV) files. The type of each row of data is inferred from data, which can then be streamed from disk, or worked with in memory.

We provide streaming and in-memory interfaces for efficiently working with datasets that can be safely indexed by column names found in the data files themselves. This type safety of column access and manipulation is checked at compile time.

Use Cases

For a running example, we will use variations of the prestige.csv data set. Each row includes 7 columns, but we just want to compute the average ratio of income to prestige.

Clean Data

If you have a CSV data where the values of each column may be classified by a single type, and ideally you have a header row giving each column a name, you may simply want to avoid writing out the Haskell type corresponding to each row. Frames provides TemplateHaskell machinery to infer a Haskell type for each row of your data set, thus preventing the situation where your code quietly diverges from your data.

We generate a collection of definitions generated by inspecting the data file at compile time (using tableTypes), then, at runtime, load that data into column-oriented storage in memory with a row-oriented interface (an in-core array of structures (AoS)). We're going to compute the average ratio of two columns, so we'll use the foldl library. Our fold will project the columns we want, and apply a function that divides one by the other after appropriate numeric type conversions. Here is the entirety of that program.

{-# LANGUAGE DataKinds, FlexibleContexts, QuasiQuotes, TemplateHaskell, TypeApplications #-}
module UncurryFold where
import qualified Control.Foldl                 as L
import           Data.Vinyl.Curry               ( runcurryX )
import           Frames

-- Data set from http://vincentarelbundock.github.io/Rdatasets/datasets.html
tableTypes "Row" "test/data/prestige.csv"

loadRows :: IO (Frame Row)
loadRows = inCoreAoS (readTable "test/data/prestige.csv")

-- | Compute the ratio of income to prestige for a record containing
-- only those fields.
ratio :: Record '[Income, Prestige] -> Double
ratio = runcurryX (\i p -> fromIntegral i / p)

averageRatio :: IO Double
averageRatio = L.fold (L.premap (ratio . rcast) avg) <$> loadRows
  where avg = (/) <$> L.sum <*> L.genericLength

Missing Header Row

Now consider a case where our data file lacks a header row (I deleted the first row from `prestige.csv`). We will provide our own name for the generated row type, our own column names, and, for the sake of demonstration, we will also specify a prefix to be added to every column-based identifier (particularly useful if the column names do come from a header row, and you want to work with multiple CSV files some of whose column names coincide). We customize behavior by updating whichever fields of the record produced by rowGen we care to change, passing the result to tableTypes'. Link to code.

{-# LANGUAGE DataKinds, FlexibleContexts, QuasiQuotes, TemplateHaskell, TypeApplications #-}
module UncurryFoldNoHeader where
import qualified Control.Foldl                 as L
import           Data.Vinyl.Curry               ( runcurryX )
import           Frames
import           Frames.TH                      ( rowGen
                                                , RowGen(..)
                                                )

-- Data set from http://vincentarelbundock.github.io/Rdatasets/datasets.html
tableTypes' (rowGen "test/data/prestigeNoHeader.csv")
            { rowTypeName = "NoH"
            , columnNames = [ "Job", "Schooling", "Money", "Females"
                            , "Respect", "Census", "Category" ]
            , tablePrefix = "NoHead"}

loadRows :: IO (Frame NoH)
loadRows = inCoreAoS (readTableOpt noHParser "test/data/prestigeNoHeader.csv")

-- | Compute the ratio of money to respect for a record containing
-- only those fields.
ratio :: Record '[NoHeadMoney, NoHeadRespect] -> Double
ratio = runcurryX (\m r -> fromIntegral m / r)

averageRatio :: IO Double
averageRatio = L.fold (L.premap (ratio . rcast) avg) <$> loadRows
  where avg = (/) <$> L.sum <*> L.genericLength

Missing Data

Sometimes not every row has a value for every column. I went ahead and blanked the prestige column of every row whose type column was NA in prestige.csv. For example, the first such row now reads,

"athletes",11.44,8206,8.13,,3373,NA

We can no longer parse a Double for that row, so we will work with row types parameterized by a Maybe type constructor. We are substantially filtering our data, so we will perform this operation in a streaming fashion without ever loading the entire table into memory. Our process will be to check if the prestige column was parsed, only keeping those rows for which it was not, then project the income column from those rows, and finally throw away Nothing elements. Link to code.

{-# LANGUAGE DataKinds, FlexibleContexts, QuasiQuotes, TemplateHaskell, TypeApplications, TypeOperators #-}
module UncurryFoldPartialData where
import qualified Control.Foldl as L
import Data.Maybe (isNothing)
import Data.Vinyl.XRec (toHKD)
import Frames
import Pipes (Producer, (>->))
import qualified Pipes.Prelude as P

-- Data set from http://vincentarelbundock.github.io/Rdatasets/datasets.html
-- The prestige column has been left blank for rows whose "type" is
-- listed as "NA".
tableTypes "Row" "test/data/prestigePartial.csv"

-- | A pipes 'Producer' of our 'Row' type with a column functor of
-- 'Maybe'. That is, each element of each row may have failed to parse
-- from the CSV file.
maybeRows :: MonadSafe m => Producer (Rec (Maybe :. ElField) (RecordColumns Row)) m ()
maybeRows = readTableMaybe "test/data/prestigePartial.csv"

-- | Return the number of rows with unknown prestige, and the average
-- income of those rows.
incomeOfUnknownPrestige :: IO (Int, Double)
incomeOfUnknownPrestige =
  runSafeEffect . L.purely P.fold avg $
    maybeRows >-> P.filter prestigeUnknown >-> P.map getIncome >-> P.concat
  where avg = (\s l -> (l, s / fromIntegral l)) <$> L.sum <*> L.length
        getIncome = fmap fromIntegral . toHKD . rget @Income
        prestigeUnknown :: Rec (Maybe :. ElField) (RecordColumns Row) -> Bool
        prestigeUnknown = isNothing . toHKD . rget @Prestige

Tutorial

For comparison to working with data frames in other languages, see the tutorial.

Demos

There are various demos in the repository. Be sure to run the getdata build target to download the data files used by the demos! You can also download the data files manually and put them in a data directory in the directory from which you will be running the executables.

Contribute

You can build Frames via nix with the following command:

nix build .#Frames-8107  # or nix build .#Frames-921

this creates an ./result link in the current folder.

To get a development shell with all libraries, you can run:

nix develop .#Frames-921

To get just ghc and cabal in your shell, a simple nix develop will do.

Benchmarks

The benchmark shows several ways of dealing with data when you want to perform multiple traversals.

Another demo shows how to fuse multiple passes into one so that the full data set is never resident in memory. A Pandas version of a similar program is also provided for comparison.

This is a trivial program, but shows that performance is comparable to Pandas, and the memory savings of a compiled program are substantial.

First with Pandas,

$ nix-shell -p 'python3.withPackages (p: [p.pandas])' --run '$(which time) -f "%Uuser %Ssystem %Eelapsed %PCPU; %Mmaxresident KB" python benchmarks/panda.py'
28.087476512228815
-81.90356506136422
0.67user 0.04system 0:00.72elapsed 99%CPU; 79376maxresident KB

Then with Frames,

$ $(which time) -f '%Uuser %Ssystem %Eelapsed %PCPU; %Mmaxresident KB' dist-newstyle/build/x86_64-linux/ghc-8.10.4/Frames-0.7.2/x/benchdemo/build/benchdemo/benchdemo
28.087476512228815
-81.90356506136422
0.36user 0.00system 0:00.37elapsed 100%CPU; 5088maxresident KB

frames's People

Contributors

Stargazers

Watchers

Forkers

houshuang clhodapp qrilka commandodev dougburke cje meiersi noinia drwebb smunix dalejordan dmatysiak dbihbka andrewthad schernichkin sid-kap shapr gdevanla yuhangwang pseudonom codygman dboyliao hodapp87 treeowl pepeiborra axman6 dmvianna cfhammill kaitanie o1lo01ol1o gwils mmesch vertexcite stewarthu andrewdmeier ajchapman adamconnersax teto ulysses4ever idontgetoutmuch freefrancisco

frames's Issues

Dealing with missing values

Great library, and serving as my introduction to Vinyl. I want to parse some CSV files with missing values, so I'm dealing with Maybe types which I want to convert into a fully populated type by using some default values kind of like so.

type MaybeUser = Rec Maybe '[Occupation, ... ] 
type User = Record '[Occupation, ... ]

instance Default User where
  def = def &: ... &: RNil -- Nice to find a better way to write this

-- This doesn't work because obviously can't infer x is a instance of class default
fromUserMaybe :: MaybeUser -> User
fromUserMaybe = rmap (\x -> Identity $ fromMaybe default x) 

-- This doesn't work as well
fromUserMaybe' :: MaybeUser -> User
fromUserMaybe' = mapMethod [pr|Default|] (\x -> Identity $ fromMaybe default x)

As I hope is clear I want to set up a pipe lines sort of like

userProd :: Producer User IO () 
userProd = readTableMaybeOpt userParser "File.csv" >-> P.map fromUserMaybe

Thank your work on this library, it's been very cool to experiment with. This does seem like a really common case though, and hopefully you can add some functionality to cover this.

what about hdf5 ?

Hello,

do you plan to add the hdf5 files support for frames ?

Quoted Doubles over 999 containing comma don't parse

E.g. "1,000,000","1,055.16"

The one million parses, 1055.16 does not. I hope to look at it after dinner but wanted to submit this issue in case I cannot.

Handle missing data with Bookkeeper

From former experience with Frames, I recall having to wrap almost all of my column types with Maybe to handle missing data. I was wondering if something like the recently announced bookkeeper library might be able to address this issue. If you think it might be worth looking into, I'd love to take a shot at prototyping a Bookkeeper-backed Frames PR.

Showable instance for CoRec ColInfo ts to debug bestRep

It would be quite nice to give some text to bestRep and get back a showable representation. I'm trying to figure out how to pass some predefined types to bestRep right now as part of #66 and this would be pretty useful for me I think.

Thoughts @acowley ?

Newest version of readable has removed instance for Text

mightybyte/readable@bfc1074

This seems to the latest version on hackage. I'll try and submit a patch in the next day or so.

readTable function for finding rows that don't parse

Something that returns (or when consumed returns) the rows parsed successfully and the text lines that failed to parse.

Maybe the type could be:

someReadTableFunction :: readTableOpt' :: forall m rs.
                 (MonadPlus m, MonadIO m, ReadRec rs)
=> ParserOptions -> FilePath -> m (T.Text,Record rs)

I seem to be running into unmanaged dependencies.

I'm trying to go through the tutorial, line by line, that is linked here: https://github.com/acowley/Frames/blob/master/demo/Tutorial.hs

I've run cabal configure, build, and install in the main directory, after which the cabal repl has no problem compiling Frames:

pwnmachine $ cabal repl
Preprocessing library Frames-0.1.0.0...
GHCi, version 7.8.3: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
[...]
Loading package pipes-4.1.4 ... linking ... done.
[ 1 of 13] Compiling Frames.TypeLevel ( src/Frames/TypeLevel.hs, interpreted )
[ 2 of 13] Compiling Frames.CoRec     ( src/Frames/CoRec.hs, interpreted )
[...]
[13 of 13] Compiling Frames           ( src/Frames.hs, interpreted )
Ok, modules loaded: Frames, Frames.Col, Frames.ColumnTypeable, Frames.ColumnUniverse, Frames.CoRec, Frames.CSV, Frames.Exploration, Frames.Frame, Frames.InCore, Frames.Rec, Frames.RecF, Frames.RecLens, Frames.TypeLevel.
*Frames>

Unfortunately, there are a number of modules mentioned in the Demo that seem not to be included as dependencies. For example, I run into:

*Frames> import qualified Control.Foldl as L
 <no location info>:
     Could not find module ‘Control.Foldl’
     It is not a module in the current program, or in any known package.
 *Frames> import Lens.Family
 <no location info>:
    Could not find module ‘Lens.Family’
    It is not a module in the current program, or in any known package.

Please let me know if you need any other information on my installation or process. It's quite possible that there's something that I'm not doing right.

At any rate, I'm really excited to see this library and can't wait to start using it. Frames itself seems to work fine. It's just that I'm not able to run the tutorial because it seems to have unlisted dependencies.

Multiple custom column types

I have two custom column types

newtype NumericalAnswer = NumericalAnswer (Maybe Int) deriving (Show,Read,Eq,Ord)
newtype SkillLevel = SkillLevel (Maybe Text) deriving (Show,Read,Eq,Ord)

instance Readable NumericalAnswer where
  fromText t
    | t == "0" = return $ NumericalAnswer $ Just 0
    | t == "1" = return $ NumericalAnswer $ Just 1
    | t == "2" = return $ NumericalAnswer $ Just 2
    | t == "3" = return $ NumericalAnswer $ Just 3
    | t == "4" = return $ NumericalAnswer $ Just 4
    | t == "\"\"" || t == "Don't know" = return $ NumericalAnswer Nothing
    | otherwise = mzero

instance Readable SkillLevel where
  fromText t
    | t `elem` ["none","beginner","learning","competent","expert"] = return $ SkillLevel (Just t)
    | t == "\"\"" = return $ SkillLevel Nothing
    | otherwise = mzero

When I add only one of them to MyColumns and call

tableTypes'
  rowGen {rowTypeName = "Answer"
         ,columnUniverse = $(colQ ''MyColumns)}
  "data/raw-anon-v1-0.csv"

exactly those columns I want end up being converted to the added type. However if I define

type MyColumns =  SkillLevel ': NumericalAnswer ': CommonColumns

the columns that should convert to NumericalAnswer (and do so if I don't add SkillLevel) stay Text.
It's probably me doing something terribly wrong, but I can't quite figure out what.

Use types default value instead of not parsing row

col_a,col_b
"0","x"
"2","x"
"1",x"
,"x"
"0","x"

The current behavior returns:

{col_a :-> False, col_b :-> "x"}
{col_a :-> True, col_b :-> "x"}
{col_a :-> False, col_b :-> "x"}

I'd like a way to get:

{col_a :-> 0, col_b :-> "x"}
{col_a :-> 2, col_b :-> "x"}
{col_a :-> 1, col_b :-> "x"}
{col_a :-> 0, col_b :-> "x"} -- this one was blank and defaulted according to type int
{col_a :-> 0, col_b :-> "x"}

I tried modifying coRecToRec to use mempty instead of returning a maybe value but to be honest I'm just barely understanding how it works and it's quite late, so off to sleep ;)

Can't run demos

Maybe a duplicate of #23, except that i can follow the steps and install everything, but i get an error when trying to run the demos:

 Frames (master)*$ cabal run plot
Preprocessing library Frames-0.1.0.0...
In-place registering Frames-0.1.0.0...
Preprocessing executable 'plot' for Frames-0.1.0.0...

demo/Plot.hs:4:8:
    Could not find module ‘Diagrams’
    Use -v to see a list of the files searched for.
 Frames (master)*$

declareColumn type should override tableTypes

Code

{-# LANGUAGE TypeOperators #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE TemplateHaskell #-}
module TestDouble where

import Frames
import Frames.CSV
import Data.Vinyl
import Control.Monad.IO.Class
import qualified Pipes as P
import qualified Pipes.Prelude as P

declareColumn "col_c" ''Double
tableTypes' rowGen "test.csv"

expected type

type Row =
  Record '["col_a" :-> Int, "col_b" :-> Int, "col_c" :-> Double]

actual type after attempting (but probably failing to ensure) column types aren't created if existing

type Row =
  Record '["col_a" :-> Int, "col_b" :-> Int, "col_c" :-> Bool]

Consider using Cassava for csv parsing

I'm sure this was something you were going to consider eventually, but thought I'd get the ball rolling since I ran into some problems trying to play with this dataset:

https://data.texas.gov/Business-and-Economy/DIR-Cooperative-Contract-Sales-Data-Fiscal-2010-to/w4x7-gfja

I believe the handling of csvs chokes on newlines not meaning "one row" in Frames current csv parsing scheme, but I tried it out in Cassava and everything parsed fine.

I thought of two different approaches:

Make a version of cassava that works with frames and update functions like decodeWithP.
Generate Named Record Instances and parse as is already done in cassava but then output in a Vinyl record.

EDIT: Perhaps we can be really lazy and just use the record parser with the current Frames csv implementation.

frameUncons not exported

Did you mean to leave this function off the export list?

It seems to be more general than recUncons which is exported.

The naming of frameCons, frameConsA, frameUncons and frameSnoc seems confusing since they operate at the Rec level and have nothing to do with Frames (the type, not the package).

CoRec as a standalone library?

Any plans to export this CoRec structure to its own library? or perhaps as part of vinyl? I'm interested in trying to use this for the upcoming vinyl rework of webdriver, but I'd like to only depend on a library specifically tailored to CoRec.

Add Cassandra support

I've been working on this a little (hopefully more soon) and thought I'd put up an issue to keep track of it and ask questions.

cabal: Cannot build the executable 'getdata' because the component is marked as disabled in the .cabal file.

I am checking out Frames and I'm trying the command
'cabal run getdata'. When I do so I get the error "cabal: Cannot build the executable 'getdata' because the component is marked as disabled in the .cabal file." I opened up Frames.cabal and noticed that it's some flag. I am new to Haskell so I'm not sure how to set the flag for demos to true.

I've tried cabal run -f demos getdata.

Thanks.

merging csvs

In R, I have this

csvs = lapply(list_of_files, read.csv)
r=Reduce(function(x,y) {merge(x,y, by="key",all=TRUE)}, csvs[-1], csvs[[1]])
write.csv(r, file="")

Roughly transliterated into Haskell and ignoring the IO, I would want something like

type CSV = [[(String,Maybe String)]]
merge :: String -> NEList CSV -> CSV

the new headers would be the concatenation (with renaming upon collision) of all the given headers.
as an example:

merge [  [[("key", Just "foo"), ("data1", Just "bar")]]
              ,[("key", Just "baz"), ("data1", Just "quux")]
              ]
           ,  [[("key", Just "foo"),("data2", Just "x")]]
           ]

should give

[[("key", Just "foo"), ("data1", Just "bar"), ("data2", Just "x")]
,[("key", Just "baz"), ("data1", Just "quux"), ("data2", Nothing)]
]

obviously this is a dreadfully inefficient representation but I hope the semantics are clear.

Improve Double inference

CSV

col_a,col_b,col_c
1,9,"1,000,000.00"
2,8,"300,000.00"
3,7,"0"
4,6,0

Code

{-# LANGUAGE TypeOperators #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE TemplateHaskell #-}
module TestDouble where

import Frames
import Frames.CSV
import Data.Vinyl
import Control.Monad.IO.Class
import qualified Pipes as P
import qualified Pipes.Prelude as P

tableTypes' rowGen "test.csv"

expected type

type Row =
  Record '["col_a" :-> Int, "col_b" :-> Int, "col_c" :-> Double]

actual type

type Row =
  Record '["col_a" :-> Int, "col_b" :-> Int, "col_c" :-> Bool]

shell.nix doesn't seem to work

[cody@cody-nixos:~/sources/Frames]$ nix-shell --pure shell.nix
error: attempt to call something which is not a function but a set, at "/home/cody/sources/Frames/shell.nix":2:7

shell.nix

1 with (import <nixpkgs> {}).pkgs;
2 with (import <nixpkgs/pkgs/development/haskell-modules/lib.nix> { pkgs = import <nixpkgs> {}; });
3 let
4     hsPkgs = haskellngPackages.override {
5       overrides = self: super: {
6         Rasterific = dontCheck (super.Rasterific);
7         readable = overrideCabal (super.readable.overrideScope (self: super: { }))(drv: {
8            version = "0.2.0.2";
9            sha256 = "0dv1xr4y5azcr8xnhsl7i8ab56mkq7b89x55c2rg9kfakmgxiqcl";
0            });
1       };
2     };
3     pkg = hsPkgs.callPackage
4             ({ mkDerivation, base, ghc-prim, pipes, readable, stdenv
5              , template-haskell, text, transformers, vector, vinyl
6              , cairo, diagrams, diagrams-rasterific, Chart, Chart-diagrams
7              , lens, lens-family, foldl, list-t, http-client, statistics, zip-archive
8              }:
9              mkDerivation {
0                pname = "Frames";
1                version = "0.1.0.0";
2                src = ./.;
3                isLibrary = true;
4                isExecutable = true;
5                buildDepends = [
6                  base ghc-prim pipes readable template-haskell text transformers
7                  vector vinyl
8                  cairo diagrams diagrams-rasterific Chart Chart-diagrams
9                  lens lens-family foldl list-t http-client statistics zip-archive
0                ];
1                description = "Data frames For working with tabular data files";
2                license = stdenv.lib.licenses.bsd3;
3              }) {};
4 in
5   pkg.env

Maybe this is meant to be used with a different version of nix?

Benchmarks fail to compile on GHC 7.10.3

When I try to build the benchmarks from Hackage, it errors with:

Preprocessing benchmark 'insurance' for Frames-0.1.2.1...
[1 of 1] Compiling Main             ( benchmarks/InsuranceBench.hs, .stack-work/dist/x86_64-linux/Cabal-1.22.5.0/build/insurance/insurance-tmp/Main.o )

/nfs/home/crest-team/jenkins_workspaces/cutter/workspace/benchmark_stackage/label/CREST_cutter/.bench-build/lts-5.16/Frames-0.1.2.1/benchmarks/InsuranceBench.hs:1:1:
    Exception when trying to run compile-time code:
      data/FL2.csv: openFile: does not exist (No such file or directory)
    Code: tableTypes "Ins" "data/FL2.csv"

MissingData has no main method

The missing executable specifies the MissingData module as main-is. However that module doesn't contain a main method. How is this supposed to work?

mapMethod for categorical value

When i want scale qualitative value i can use mapMono or mapMethod for do it.
What i must use for encode categorical value?

I can not use mapMono or mapMethod because i have to change Filed type but with mapMono or mapMethod i can use only (a -> a) functions.

mapPoly :: (RElem r rs (RIndex r rs), CanDelete r rs) 
               => (forall f. Functor f => (a -> f a) -> Record rs -> f (Record rs))
               -> Record rs 
               -> proxy r 
               -> (a -> b) 
               -> Record (s :-> b ': RDelete r rs)
mapPoly c r p f = frameCons (pure (f (rget c r))) (rdel p r)

I wrote function to solve that problem, but maybe someone knows more beautiful solution?

Consider adding an instance of Literal for Frames for use with HaskellR

It would be nice as a stop gap while we wait for ggplot to make its way to Haskell for the Frames library to easily be spliced into R and vice versa using HaskellR for easy data frame transference. This should be fairly straightforward as outlined here:

http://tweag.github.io/HaskellR/docs/splicing-haskell-values-in-r.html

Stack overflow compiling large Record (100+ columns) with Ghc 8.0.1

I just wanted to put here as a note that I got a stack overflow. error compiling a rather large vinyl record with GHC 8.0.1.

I think it is fixed in GHC 8.0.2 because of this bugfix. Unconfirmed, but hope o post back here. Wanted to record this in case any other users run into a similar issue. A workaround for me in the meantime was to compile with -O0.

Plot demo doesn't work (even after getdata succeeds).

It seems to be failing on the nonexistence of a local file that it expects the getdata task to generate. The error is this:

pwnmachine $ cabal run plot
Preprocessing library Frames-0.1.0.0...
In-place registering Frames-0.1.0.0...
Preprocessing executable 'plot' for Frames-0.1.0.0...
[1 of 1] Compiling Main             ( demo/Plot.hs, dist/build/plot/plot-tmp/Main.o )
Loading package ghc-prim ... linking ... done.
[...]
Loading package Chart-diagrams-1.3.3 ... linking ... done.

demo/Plot.hs:1:1:
    Exception when trying to run compile-time code:
      data/trigly_d.csv: openFile: does not exist (No such file or directory)
    Code: tableTypes "Trigly" "data/trigly_d.csv"

tableTypes should create Pipes Producer

In my workflow I constantly have:

tableTypes "User" "data/ml-100k/u.user"

movieStream :: Producer User IO ()
movieStream = readTableOpt userParser "data/ml-100k/u.user"

I'd like to just shorten this to:

tableTypes "User" "data/ml-100k/u.user"

And have movieStream created for me automatically. I don't typically have a reason for my table types filepath and movieStream filepath to differ, though I acknowledge this could be a valid use case.

Perhaps we could have a default of using the same filepath in both places and override it, e.g.:

default:

tableTypes "User" "data/ml-100k/u.user"

override:

tableTypes "User" "data/ml-100k/u.user" "someotherplace"

Or perhaps something more along the lines of overriding the filepath to stream from with RowGen or a similar Record.

Bug in instance Monoid Frame

Instead of:

 Frame l1 f1 `mappend` Frame l2 f2 = Frame (l1+l2) $ \i ->
                                      if i < l1 then f1 i else f2 i

I think you want:

 Frame l1 f1 `mappend` Frame l2 f2 = Frame (l1+l2) $ \i ->
                                      if i < l1 then f1 i else f2 (i - l1)

over

Hello,
Thanks to rget and rput, we don't need to import view and set from the Lens.Family module. The function over from Lens.Family is useful too. Is there an equivalent in Frames? Otherwise rover would be nice.

Add LocalTime to CommonColumns

I put significant thought into how to make this General enough to be useful for most, but can't post in full right now. Have you thought through adding support for fuzzy parsing date like text objects? And things like time zone selection?

How can you manually build a Record/'Row'?

type Row = Record ["col_a" :-> Text]

-- I figured out how to make something close
buildRow :: Record '[ColA]
buildRow = pure (Col "some text") :& Nil

How can I make a buildRow function of type Row? Perhaps with ColFun? Perhaps something totally different? Something is telling me I could do this by distributing Identity across the record with ColFun.

Machines?

Given your work on machines, I'm curious why you opted to use pipes for this project

Adding a derived column

Is it possible to add a derived columns to the frame? I see that there is a frameCons that can add a column to the frame but I am not sure how to go about defining a type signature for the new Column as all the other types where generated by TH and I have no idea how that works.

For example I have a table with number of girls and boys enrolled each year at primary school and I want to add a derived column for the total number of kids enrolled each year.

Adding a derived column is a fairly common operation in R so being able to duplicate it easily would be a plus.

Thanks
Riaan

A User Report

I am not sure this is an issue but I thought I would report back on my experience with a 100M file with 24,001 columns.

tableTypes "User" "clouds.csv" takes a long time 5+ minutes.
I didn't dare do :r but just did let pfStream :: Producer User IO (); pfStream = readTableOpt userParser "clouds.csv" at the ghci prompt. This gave Context reduction stack overflow; size = 101 with an error message of over 24,000 lines!
At this point, I am heading back to R :-(

Let me know if you want any more details.

More examples

I would love some more examples, as I struggle to understand the types and the setup here, for example:

how do I request a row by number/id
how do I request a row where col A conforms to some predicate
etc.

Here are some of the things I listed on sdiehl/frame:

selecting rows based on a single column predicate (the equivalent of
db[db$age > 30,] in R)

selecting rows based on multiple column predicates (the equivalent of
db[db$age > 30 && db$weight >50,])

creating a new hframe where a row has been modified by a function
(equivalent of db$age = db$age * 2, but functionally) - or something like
newframe
= fmap (* 2) (frame ! "age")

creating a new hframe with an added column calculated based on one or
more existing columns (the equivalent of db$derived = db$age / db$weight,
but functionally)

an example of groupBy

however he had already provided examples of the basics above (getting a column, transforming it by a function etc).

The current examples are either very simple (printing a table), or fairly involved.

Anything at all would be helpful, thanks!

tableTypes should conditionally create type synonym

Say I have 2 csvs with these headers:

manager,age,pay

employee,age,pay

tableTypes will fail if I try to use it for both of them.

Could tableTypes conditionally create type synonyms here to make this work?

Context reduction stack overflow

Trying to reproduce one of the examples with my own data. Downloaded the CSV file here: https://raw.github.com/amercader/car-fuel-and-emissions/master/data.csv

Script

{-# LANGUAGE DataKinds, FlexibleContexts, TemplateHaskell #-}
import Frames
import qualified ListT as L

tableTypes "Ins" "/Users/stian/src/Frames/src/Frames/data.csv" 

listTlist :: Monad m => L.ListT m a -> m [a]
listTlist = L.toList

tbl :: IO [Ins]
tbl = listTlist $ readTable' "/Users/stian/src/Frames/src/Frames/data.csv" 

main = tbl >>= mapM_ print```

Fails on compilation

test.hs:11:19:
    Context reduction stack overflow; size = 21
    Use -fcontext-stack=N to increase stack size to N
      Frames.CSV.ReadRec
        '["nox_emissions" :-> Text, "thc_nox_emissions" :-> Text,
          "particulates_emissions" :-> Text,
          "fuel_cost_12000_miles" :-> Text, "fuel_cost_6000_miles" :-> Text,
          "standard_12_months" :-> Text, "standard_6_months" :-> Text,
          "first_year_12_months" :-> Text, "first_year_6_months" :-> Text,
          "date_of_change" :-> Text]
    In the second argument of ‘($)’, namely
      ‘readTable' "data/data1.csv"’
    In the expression: listTlist $ readTable' "data/data1.csv"
    In an equation for ‘tbl’:
        tbl = listTlist $ readTable' "data/data1.csv"

The generated types:

type Ins = Rec '["file" :-> Text, "year" :-> Int, "manufacturer" :-> Text, "model" :-> Text, "description" :-> Text, "euro_standard" :-> Text, "tax_band" :-> Text, "transmission" :-> Text, "transmission_type" :-> Text, "engine_capacity" :-> Text, "fuel_type" :-> Text, "urban_metric" :-> Text, "extra_urban_metric" :-> Text, "combined_metric" :-> Text, "urban_imperial" :-> Text, "extra_urban_imperial" :-> Double, "combined_imperial" :-> Double, "noise_level" :-> Double, "co2" :-> Double, "thc_emissions" :-> Text, "co_emissions" :-> Text, "nox_emissions" :-> Text, "thc_nox_emissions" :-> Text, "particulates_emissions" :-> Text, "fuel_cost_12000_miles" :-> Text, "fuel_cost_6000_miles" :-> Text, "standard_12_months" :-> Text, "standard_6_months" :-> Text, "first_year_12_months" :-> Text, "first_year_6_months" :-> Text, "date_of_change" :-> Text] -- Defined at :1:1

It is indeed a problem with the number of fields, it reliably works with less than 21 fields, and if I use ghc test.hs -fcontext-stack=100, my problem above works as well. Perhaps just something to note in the docs, I'm not sure if there is any way to code around it.

CSV parsing doesn't support quoting of separator like RFC4180 specifies

E.g. the row:

foo,"bar,baz",qux

will be parsed as ["foo", "\"bar", "baz\"", "qux"]

Can't install demos

I tried to run cabal install --dependencies-only -f demos in a new sandbox, and cabal threw an error. I think this is because of the recent update of diagrams to version 1.3. I think this could be fixed by specifying version ranges for each dependency in the demos, but I am not sure what versions to use.

Here is the output of the command:

$ cabal install --dependencies-only -f demos

Resolving dependencies...
cabal: Could not resolve dependencies:
trying: Frames-0.1.0.0 (user goal)
rejecting: Frames-0.1.0.0:-demos (global constraint requires opposite flag
selection)
trying: Frames-0.1.0.0:+demos
trying: diagrams-lib-1.3 (dependency of Frames-0.1.0.0:+demos)
trying: diagrams-core-1.3 (dependency of diagrams-lib-1.3)
next goal: Chart-diagrams (dependency of Frames-0.1.0.0:+demos)
rejecting: Chart-diagrams-1.4, 1.3.3, 1.3.2, 1.3.1, 1.3, 1.2.4, 1.2.3
(conflict: diagrams-core==1.3, Chart-diagrams => diagrams-core>=1.2 && <1.3)
rejecting: Chart-diagrams-1.2.2 (conflict: diagrams-core==1.3, Chart-diagrams
=> diagrams-core>=1.1 && <1.2)
rejecting: Chart-diagrams-1.2 (conflict: diagrams-core==1.3, Chart-diagrams =>
diagrams-core>=0.7 && <1.2)
rejecting: Chart-diagrams-1.1 (conflict: diagrams-core==1.3, Chart-diagrams =>
diagrams-core>=0.7 && <0.8)
rejecting: Chart-diagrams-1.0 (conflict: diagrams-core==1.3, Chart-diagrams =>
diagrams-core>=0.7 && <1.0)
Dependency tree exhaustively searched.

Note: when using a sandbox, all packages are required to have consistent
dependencies. Try reinstalling/unregistering the offending packages or
recreating the sandbox.

CoRec as extendible open Union

I'm (planning on) using Frames' CoRecs as an extendible open union. My use case also involves pattern matching on a CoRec. I've written a few simple functions that allow me to pattern match on a coRec with something like:

foo = match myCoRec $ 
            H (\NoIntersection -> "No Intersection")
         :& H (\(x :: Int)     -> show x)
         :& RNil

See this gist for details (the gist also defines a 'asA' function to easily extract a Maybe value from a CoRec). Adding something to Frames.CoRec might be useful to others as well.

Ideas for improvements on how to do this are welcome as well of course :).

Stackage?

Do you have any plans to add Frames to Stackage?
I'm very new to this, but I would be willing to help if there's something I could do to help.

I believe it might be as easy as forking github.com/fpgo/stackage and adding the following to build-constraints.yaml:

"Anthony Cowley [email protected] @acowley":
    - Frames

I don't mind making the PR to fpgo/stackage, if you give me permission.

By the way, thanks for all your hard work on Frames! I've really enjoyed using it so far.

Any plans to publish in on hackage?

And any further plans?

What about datetime support?

E.g. I do some analysis including dates using R but switching to something like Frames would be cool.
BTW do you see any good way to do plots using Frames and probably some other package from hackage?

Strictness of value in CoRec

The Rec type in vinyl is strict both in the value and the tail. I think that is a good decision, as it simplifies reasoning about the space usage of Recs. Do you have a particular reason why you make the value in a CoRec non-strict?

(PS: Thanks for this library. It's sources are a very interesting and enlightening read!)

How to define a new type of row ?

Hello,
In order to define a new record, I firstly define new types and lenses by hand:

import qualified Data.Vinyl.TypeLevel as DVT
import qualified Data.Vinyl as DV
import qualified Data.Vinyl.Functor as DVF 

type Foo = "foo" :-> Int
foo :: (Functor f0, RElem Foo rs0 (DVT.RIndex Foo rs0))  => (Int -> f0 Int) -> Record rs0 -> f0 (Record rs0)
foo = rlens (Proxy :: Proxy Foo)
foo' :: (Functor f0, Functor g0, RElem Foo rs0 (DVT.RIndex Foo rs0)) => (g0 Foo -> f0 (g0 Foo)) -> DV.Rec g0 rs0 -> f0 (DV.Rec g0 rs0)
foo' = rlens' (Proxy :: Proxy Foo)

type Baz = "baz" :-> Float
...

Then I define a record like this:

myRec :: Record '[Baz, Foo]
myRec = DVF.Identity 1.5 DV.:& DVF.Identity 3 DV.:& DV.RNil

Is there a more straightforward way? I believe there is one with RowGen but I don't understand how to use that, especially the last field (I'm not an eminent Haskell user).

Bool inference precidence seems too high

CSV

col_a
"0"
"0"
"0"
"0"
A

Haskell

{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE TemplateHaskell #-}
module TestIt where

import Frames
import Frames.CSV
import Pipes
import qualified Pipes.Prelude as P

tableTypes' rowGen "testwrongboolinference.txt"

rowStream :: Producer Row IO ()
rowStream = readTableOpt rowParser "testwrongboolinference.txt"

Actual Type

type Row = Record '["col_a" :-> Bool]

Expected Type

type Row = Record '["col_a" :-> Text]

Since the type is inferred as Bool the last column won't be parsed successfully in the current behavior.

What do I do with compressed files?

If I'm reading this right, functions like tableTypes and readTableOpt expect uncompressed text files. What if I have a .csv.gz or .csv.xz?

Either a b -> CoRec Identity [a,b] ?

I'm trying to write a function with the following type, which I seem unable to do:

fromEither :: Either a b -> CoRec Identity [a,b]

The following trivial implementation works, but I cannot get rid of the class constraint that b is an element of [a,b]:

fromEither           :: b  ∈ [a,b] => Either a b -> CoRec Identity [a,b]
fromEither (Left x)  = Col . Identity $ x
fromEither (Right x) = Col . Identity $ x

After fiddling a bit, it seems GHC has issues proving that.

RIndex b [a,b] ~ (S Z)

When we add that as a class constraint. GHC is clever enough to figure out the details. Also note that for concrete types a and b it can also figure out the details. However, for the polymorphic case it cannot. Any ideas on how to convince GHC to convert an Either to a CoRec without the class constraint?

compile failure with GHC 7.8

Frames's .cabal-file claims to be compatible with pre-AMP base-4.7, but it seems not to be:

[ 1 of 14] Compiling Frames.ColumnTypeable ( src/Frames/ColumnTypeable.hs, dist/dist-sandbox-daaa14a/build/Frames/ColumnTypeable.o )

src/Frames/ColumnTypeable.hs:24:11:
    Could not deduce (Functor m) arising from a use of ‘fmap’
    from the context (Parseable a)
      bound by the class declaration for ‘Parseable’
      at src/Frames/ColumnTypeable.hs:(16,1)-(25,22)
    or from (Readable a, MonadPlus m)
      bound by the type signature for
                 parse :: (Readable a, MonadPlus m) => T.Text -> m (Parsed a)
      at src/Frames/ColumnTypeable.hs:21:12-48
    Possible fix:
      add (Functor m) to the context of
        the type signature for
          parse :: (Readable a, MonadPlus m) => T.Text -> m (Parsed a)
        or the class declaration for ‘Parseable’
    In the first argument of ‘(.)’, namely ‘fmap Definitely’
    In the expression: fmap Definitely . fromText
    In an equation for ‘parse’: parse = fmap Definitely . fromText

src/Frames/ColumnTypeable.hs:35:10:
    Could not deduce (Functor m) arising from a use of ‘fmap’
    from the context (MonadPlus m, Parseable a)
      bound by the type signature for
                 parse' :: (MonadPlus m, Parseable a) => T.Text -> m a
      at src/Frames/ColumnTypeable.hs:34:11-53
    Possible fix:
      add (Functor m) to the context of
        the type signature for
          parse' :: (MonadPlus m, Parseable a) => T.Text -> m a
    In the first argument of ‘(.)’, namely ‘fmap discardConfidence’
    In the expression: fmap discardConfidence . parse
    In an equation for ‘parse'’:
        parse' = fmap discardConfidence . parse
xcabal: Error: some packages failed to install:
Frames-0.1.4 failed during the building phase. The exception was:
ExitFailure 1

filtering an in memory frame

The tutorial mentions filtering but that relies on the pipes library which obviously only makes sense when using the streaming version using producers. I couldn't figure out a way to filter an in memory frame. I am sure I am missing something, but I am not quite sure what.

acowley / frames Goto Github PK

frames's Introduction

Frames

Data Frames for Haskell

Use Cases

Clean Data

Missing Header Row

Missing Data

Tutorial

Demos

Contribute

Benchmarks

frames's People

Contributors

Stargazers

Watchers

Forkers

frames's Issues

Code

expected type

actual type after attempting (but probably failing to ensure) column types aren't created if existing

CSV

Code

expected type

actual type

CSV

Haskell

Actual Type

Expected Type

Recommend Projects

Recommend Topics

Recommend Org