Code Monkey home page Code Monkey logo

dh-core's Introduction

DataHaskell/dh-core

Build Status

DataHaskell core project monorepo

Aims

This project aims to provide a native, end-to-end data science toolkit in Haskell. To achieve this, many types of experience are valuable; engineers, scientists, programmers, visualization experts, data journalists are all welcome to join the discussions and contribute. Not only this should be a working piece of software, but it should be intuitive and pleasant to use. All contributions, big or small, are very welcome and will be acknowledged.

Architecture

One single repository allows us to experiment with interfaces and move code around much more freely than many single-purpose repositories. Also, it makes it more convenient to track and visualize progress.

This is the directory structure of the project; the main project lives in the dh-core subdirectory:

dh-core/
  dh-core/              
  dh-core-accelerate/
  ....

Contributed packages

A number of authors and maintainers agreed to move ownership of their repositories under the dh-core umbrella. In some cases, these packages were already published on Hackage and cannot simply disappear from there, nor can this new line of development break downstream packages.

For this reason, contributed packages will appear as subdirectories to the main dh-core project, and will need to retain their original .cabal file.

The stack tool can take care of multi-package projects; its packages stanza in the stack.yaml file has only its directory as a default, but can contain a list of paths to other Cabal projects; e.g. in our case it could look like:

packages:
- .
- analyze/
- datasets/

Packages that are listed on Hackage already must be added here as distinct sub-directories. Once the migration is complete (PRs merged etc.), add the project to this table :

Package Description Original author(s) First version after merge
analyze Data analysis and manipulation library Eric Conlon 0.2.0
datasets A collection of ready-to-use datasets Tom Nielsen 0.2.6
dense-linear-algebra Fast, native dense linear algebra primitives Brian O'Sullivan, Alexey Khudyakov 0.1.0 (a)

(a) : To be updated

NB: Remember to bump version numbers and change web links accordingly when moving in contributed packages.

Contributing

  1. Open an issue (https://github.com/DataHaskell/dh-core/issues) with a description of what you want to work on (if it's not already open)
  2. Assign or add yourself to the issue contributors
  3. Pull from dh-core:master, start a git branch, add code
  4. Add tests
  5. Update the changelog, describing briefly your changes and their possible effects
  • If you're working on a contributed package (see next section), increase the version number in the Cabal file accordingly

  • If you bumped version numbers, make sure these are updated accordingly in the Travis CI .yaml file

  1. Send a pull request with your branch, referencing the issue
  2. dh-core admins : merge only after another admin has reviewed and approved the PR

GHC and Stackage compatibility

Tested against :

  • Stackage nightly-2019-02-27 (GHC 8.6.3)

Development information and guidelines

Dependencies

We use the stack build tool.

Some systems /might/ need binaries and headers for these additional libraries:

  • zlib
  • curl

(however if you're unsure, first try building with your current configuration).

Nix users should set nix.enable to true in the dh-core/dh-core/stack.yaml file.

Building instructions

In the dh-core/dh-core subdirectory, run

$ stack build

and this will re-build the main project and the contributed packages.

While developing this stack command can come in handy : it will trigger a re-build and run the tests every time a file in the project is modified:

$ stack build --test --ghc-options -Wall --file-watch

Testing

Example :

$ stack test core:doctest core:spec

The <project>:<test_suite> pairs determine which tests will be run.

Continuous Integration (TravisCI)

Travis builds dh-core and its hosted projects every time a commit is pushed to Github. Currently the dh-core/.travis.yml script uses the following command to install the GHC compiler, build the project and subprojects with stack, run the tests and build the Haddock documentation HTMLs:

- stack $ARGS --no-terminal --install-ghc test core:spec core:doctest dense-linear-algebra:spec --haddock

Visualizing the dependency tree of a package

stack can produce a .dot file with the dependency graph of a Haskell project, which can then be rendered by the dot tool (from the graphviz suite). For example, in the following command the output of stack dot will be piped into dot, which will produce a SVG file called deps.svg:

stack dot --external --no-include-base --prune rts,ghc-prim,ghc-boot-th,template-haskell,transformers,containers,deepseq,bytestring,time,primitive,vector,text,hashable | dot -Tsvg > deps.svg

dh-core's People

Contributors

adlucem avatar arvindd avatar bos avatar kaizhang avatar lehins avatar lunaticare avatar magalame avatar mjarosie avatar mmesch avatar nandaleite avatar ocramz avatar raduom avatar shimuuar avatar stites avatar unkdeve avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dh-core's Issues

Cross validation layer

Looking over the Dataloader code, I immediately thought about integrating a private dataset to play with some haskell code. This made me wonder if anyone has thought about adding a cross-validation layer on top of it. For some cannonical datasets, there are defined splits (test, train, validation), but for others, one would need to define these.

It would be nice if there were some code that could allow one to partition some given data according to k-folds and leave-p-out. In the case of timeseries datasets, you'd have to make sure the that the partitions respect the temporal ordering.

BLAS layer

Unify dense and sparse lin.alg. , for a given underlying vector type, under one same interface

Blocked by #1 and #3

dense-linear-algebra : Add chronos-bench benchmarks

Is your feature request related to a problem? Please describe.
We cannot have criterion benchmarks for dense-linear-algebra since there is this dependency chain :

criterion -> statistics -> dense-linear-algebra

chronos-bench doesn't depend on dense-linear-algebra ^_^

https://hackage.haskell.org/package/chronos-bench

Describe the solution you'd like
Have some performance benchmarks

Describe alternatives you've considered
There might be alternative benchmarking packages not based on dense-linear-algebra

dense-linear-algebra: Add support for SIMD instructions

SIMD instructions seem to be of great importance in the performances of a linear algebra library. The big question then is how to incorporate them to the rest of the library?

I've had some success with a fork of simd: https://github.com/Magalame/simd

Another question to solve would be how to:

Add dataloader for large datasets

I'm looking to write a data loader which gets image datasets from disk in the same way pytorch's DataLoader class does. It should have the option to load images in batches. Originally this was going to go into hasktorch, but I think it might be better served in datasets -- what do you think? This could be done in isolation from most of datasets, but one small wrinkle is that the code could be written to also fetch datasets like CIFAR-10 or MNIST (which I would assume is ideal) -- in that case there might be some overlap with getFileFromSource and some refactoring might be nice (like multithreaded downloads).

Does this sound like a good contribution?

Floating point and approximate comparison

In #14 I have raised point that we need set of type classes for properties of numbers and for approximate comparison.

So what I think we need is type class for things like machine epsilon, maximal and minimal representable number, transfinite numbers, NaN handling etc.. And another type class for approximate comparison of values. Design space here is rather large so it would be good to collect current state of art and implementations in different languages and libraries

datasets : lint with brittany

datasets is very inconsistent with lots of extra whitespace which causes terrible diffs. I think it, as well as dh-core, needs linting rules for consistency when developing -- also possibly github hooks to reject PRs that don't adhere.

brittany is the current standard for haskell-ide-engine, is very flexible, and has a style I'm familiar with -- so that would be my vote. If anyone has alternatives I think they should mention it here.

linting the codebase basically requires everyone to sync up on branches. Luckily there are only four forks; we should try to sync up here to eliminate the global number of rebases that will be required after a linting commit.

datasets : split off datasets-core

Medium-long term : the loading/parsing machinery is growing in size and scope (see #22 , #29 ), so those functions and types could be gathered in a separate datasets-core package. datasets will import it and add the actual datasets. Any ideas?

Bump Stackage to latest Nightly

Currently we build against Stackage LTS 11.22 but some dependencies (e.g. req) changed in a non-backward compatible way.
Fix: upgrade to Stackage nightly for now until the next LTS comes out.

datasets : harmonize Netflix parsers with the rest

The Netflix Prize dataset uses a custom parser because one data example does not fit into a single dataset row (such as CSV data) but has a custom "stanza-based" format. For example, these are two stanzas of the "qualifying.txt" data file :

1:
1046323,2005-12-19
1080030,2005-12-23
2127527,2005-12-04
1944918,2005-10-05
1057066,2005-11-07
954049,2005-12-20
10:
12868,2004-10-19
627923,2005-12-16
690763,2005-12-13

It would be nice to upgrade the library such that it can deal with these cases

Solution sketch:

  • Add one constructor to ReadAs that can accept an attoparsec parser as parameter

analyze: add usage example(s)

Possibly a binary in the app/ folder with an end-to-end workflow. Then we can split back anything good that comes out of this into the main library

Implement DoubleDouble

It's way for emulating not quite quad-precision number using two doubles. Algorithms is interesting by itself and could have few uses. But I think its main value is providing example of constant size approximation of real numbers which isn't IEEE754. It would be very useful for implementing type classes for working with low level representations of numbers. Without such examples it's all to easy to assume that only single and double IEEE754 numbers exist

Julia implementation and references could be found here https://github.com/JuliaMath/DoubleDouble.jl

datasets: add unit tests

Some unit tests asserting e.g. the length or some other property of the datasets would be nice to have.

Algorithms : Classification : Decision trees

I've started adding some code from my decision-trees project under the Core.Numeric.Statistics and Core.Data namespaces . There is some machinery that could be re-used (for example the Dataset abstraction for labeled data and some information theory functionals).

See 6bba752

Cut new release

  • test dh-core as a whole with latest changes
  • releases :
    • datasets
    • ?

analyze: reduce transitive dependencies

The set of transitive dependencies of analyze is currently quite large:

base-compat-0.9.3: build
base-orphans-0.6: build
dlist-0.8.0.3: build
cabal-doctest-1.0.2: build
integer-logarithms-1.0.2: build
mtl-2.2.1: build
primitive-0.6.2.0: build
random-1.1: build
semigroups-0.18.3: build
stm-2.4.4.1: build
text-1.2.2.2: build
time-locale-compat-0.1.1.3: build
StateVar-1.1.0.4: build
transformers-compat-0.5.1.4: build
vector-0.12.0.1: build
void-0.7.2: build
exceptions-0.8.3: build
contravariant-1.4: build
mmorph-1.0.9: build
tagged-0.8.5: build
distributive-0.5.2: build
comonad-5.0.1: build
bifunctors-5.4.2: build
profunctors-5.2: build
semigroupoids-5.2: build
free-4.12.4: build
blaze-builder-0.4.0.2: build
hashable-1.2.6.1: build
scientific-0.3.5.1: build
unordered-containers-0.2.8.0: build
attoparsec-0.13.1.0: build
uuid-types-1.0.3: build
lucid-2.9.8.1: build
vector-th-unbox-0.2.1.6: build
math-functions-0.2.1.0: build
mwc-random-0.13.6.0: build
cassava-0.4.5.1: build
aeson-1.1.2.0: build
foldl-1.2.5: build

For example I would like to understand whether free (which brings in a few dependencies) is really necessary or can be removed, in favour of a simpler (if ad-hoc) solution.

analyze: test failure

    [14 of 14] Compiling Main             ( test/Spec.hs, .stack-work/dist/x86_64-osx/Cabal-2.2.0.1/build/spec/spec-tmp/Main.o )
    
   dh-core/analyze/test/Spec.hs:30:14: error:
        • Ambiguous type variable ‘e0’ arising from a use of ‘catch’
          prevents the constraint ‘(Exception e0)’ from being solved.
          Probable fix: use a type annotation to specify what ‘e0’ should be.
          These potential instances exist:
            instance Exception SomeException -- Defined in ‘GHC.Exception’
            instance Exception A.ColSizeMismatch
              -- Defined at src/Analyze/Common.hs:37:10
            instance (Show k,
                      base-4.11.1.0:Data.Typeable.Internal.Typeable k) =>
                     Exception (A.DuplicateKeyError k)
              -- Defined at src/Analyze/Common.hs:33:10
            ...plus five others
            ...plus 17 instances involving out-of-scope types
            (use -fprint-potential-instances to see them all)
        • In the expression: catch (action >> return P.succeeded) handler
          In an equation for ‘tester’:
              tester = catch (action >> return P.succeeded) handler
          In an equation for ‘propertyIO’:
              propertyIO action
                = ioProperty tester
                where
                    tester :: IO P.Result
                    tester = catch (action >> return P.succeeded) handler
                    handler (HUnitFailure err) = return P.failed {P.reason = err}
       |
    30 |     tester = catch (action >> return P.succeeded) handler
       |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    
    dh-core/analyze/test/Spec.hs:31:14: error:
        • The constructor ‘HUnitFailure’ should have 2 arguments, but has been given 1
        • In the pattern: HUnitFailure err
          In an equation for ‘handler’:
              handler (HUnitFailure err) = return P.failed {P.reason = err}
          In an equation for ‘propertyIO’:
              propertyIO action
                = ioProperty tester
                where
                    tester :: IO P.Result
                    tester = catch (action >> return P.succeeded) handler
                    handler (HUnitFailure err) = return P.failed {P.reason = err}
       |
    31 |     handler (HUnitFailure err) = return P.failed { P.reason = err }
       |              ^^^^^^^^^^^^^^^^

Cannot build project on macOS 10.14.3

The projects fails to build on macOS 10.14.3 (Mojave).
Compiler complains it cannot find headers and binaries for zlib and curl libraries.

To Reproduce
Steps to reproduce the behavior:

  1. Run
git clone [email protected]:DataHaskell/dh-core.git
cd dh-core
git checkout a2ad2552e8525acf0ace12069d29f333d1793f05
cd dh-core
stack build --no-nix
  1. See error in log. Cannot reproduce it right now after calling stack clean, probably because built library is cached somewhere. Will try to reproduce the error with Travis build.

Workarounds

  1. Use libraries from Homebrew
brew install curl zlib
stack build \
    --extra-include-dirs=/usr/local/opt/curl/include --extra-lib-dirs=/usr/local/opt/curl/lib \
    --extra-include-dirs=/usr/local/opt/zlib/include --extra-lib-dirs=/usr/local/opt/zlib/lib
  1. Use Nix to set up proper build environment. Add the following lines to dh-core/stack.yaml:
nix:
  enable: true
  packages: 
  - curl
  - zlib

and run stack build

Environment

  • OS: macOS 10.14.3 (Mojave)
  • Stack Version 1.9.3 x86_64 hpack-0.31.1

add (lower) dependency bounds

Users who don't use stack might have a hard time building this project, so (lower) version bounds should be added to all contributed packages and to dh-core itself.

dense-linear-algebra : Weird memory and runtime behavior from `generateSym`

The generateSym function is defined as:

generateSym :: Int -> (Int -> Int -> Double) -> Matrix
generateSym n f = runST $ do
  m <- unsafeNew n n
  for 0 n $ \r -> do
    unsafeWrite m r r (f r r)
    for (r+1) n $ \c -> do
      let x = f r c
      unsafeWrite m r c x
      unsafeWrite m c r x
  unsafeFreeze m

Running it with n=100, I noted we can note that the function allocates ~ 160 000 bytes of memory, which is around twice what we would expect when allocating one Matrix.
This allocation seems to be related to the dependence on c of x. If we change f r c to f r r, the allocation drops 80 000 bytes, and the runtime is divided by two.

Add test coverage

Code coverage should be added to the Travis config (perhaps the cabal file and/or the stack options need to be changed in order to account for hpc coverage generation); currently in Travis there is only a project key.
This tool uploads hpc coverage reports to codecov.io .

BostonHousing data set URL needs to be updated.

Describe the bug
UCI ML Repository link http://mlr.cs.umass.edu/ml/datasets/housing is down and request of BostonHousing dataset is throwing an exception:

*** Exception: VanillaHttpException (HttpExceptionRequest Request {
  host                 = "mlr.cs.umass.edu"
  port                 = 80
  secure               = False
  requestHeaders       = []
  path                 = "/ml/machine-learning-databases/housing/housing.data"
  queryString          = ""
  method               = "GET"
  proxy                = Nothing
  rawBody              = False
  redirectCount        = 10
  responseTimeout      = ResponseTimeoutDefault
  requestVersion       = HTTP/1.1
  proxySecureMode      = ProxySecureWithConnect
}
 (ConnectionFailure Network.Socket.getAddrInfo (called with preferred socket type/protocol: AddrInfo {addrFlags = [], addrFamily = AF_UNSPEC, addrSocketType = Stream, addrProtocol = 0, addrAddress = 0.0.0.0:0, addrCanonName = Nothing}, host name: Just "mlr.cs.umass.edu", service name: Just "80"): does not exist (nodename nor servname provided, or not known)))

To Reproduce
Steps to reproduce the behavior:

  1. On GHCi, type import Numeric.Datasets (getDataset)
  2. Type import Numeric.Datasets.BostonHousing (bostonHousing)
  3. Type bh <- getDataset bostonHousing
  4. See error

Expected behavior
Loads the Boston Housing dataset into memory as the object bh.

Screenshots
N/A

Desktop (please complete the following information):

  • OS: macOS
  • GHC version 8.10.4

Smartphone (please complete the following information):
N/A

Additional context
Line below needs to be updated to use uciMLDB

csvDataset $ URL $ umassMLDB /: "housing" /: "housing.data"

Reference: some data sets URLs corrected on #67

datasets : add ARFF format

An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of instances sharing a set of attributes.

https://www.cs.waikato.ac.nz/ml/weka/arff.html

Overview

ARFF files have two distinct sections. The first section is the Header information, which is followed the Data information.

The Header of the ARFF file contains the name of the relation, a list of the attributes (the columns in the data), and their types. An example header on the standard IRIS dataset looks like this:

   % 1. Title: Iris Plants Database
   % 
   % 2. Sources:
   %      (a) Creator: R.A. Fisher
   %      (b) Donor: Michael Marshall (MARSHALL%[email protected])
   %      (c) Date: July, 1988
   % 
   @RELATION iris

   @ATTRIBUTE sepallength  NUMERIC
   @ATTRIBUTE sepalwidth   NUMERIC
   @ATTRIBUTE petallength  NUMERIC
   @ATTRIBUTE petalwidth   NUMERIC
   @ATTRIBUTE class        {Iris-setosa,Iris-versicolor,Iris-virginica}

The Data of the ARFF file looks like the following:

   @DATA
   5.1,3.5,1.4,0.2,Iris-setosa
   4.9,3.0,1.4,0.2,Iris-setosa
   4.7,3.2,1.3,0.2,Iris-setosa
   4.6,3.1,1.5,0.2,Iris-setosa
   5.0,3.6,1.4,0.2,Iris-setosa
   5.4,3.9,1.7,0.4,Iris-setosa
   4.6,3.4,1.4,0.3,Iris-setosa
   5.0,3.4,1.5,0.2,Iris-setosa
   4.4,2.9,1.4,0.2,Iris-setosa
   4.9,3.1,1.5,0.1,Iris-setosa

Lines that begin with a % are comments. The @RELATION, @ATTRIBUTE and @DATA declarations are case insensitive.

datasets : add exceptions

currently, the parsers error and fail here and there. Since these are synchronous exceptions, it would be better to use MonadThrow, which can be conveniently used at a "pure" type such as Maybe or Either.

  1. add exceptions as dependency
  2. import Control.Monad.Catch (MonadThrow(..))
  3. declare some parsing exceptions type (which require Typable and Exception instances, see https://www.fpcomplete.com/blog/2016/11/exceptions-best-practices-haskell)
  4. convert the calls to error and fail into calls to throwM

dense-linear-algebra: Getting stream fusion to work across `Matrix`'s

The current problem is as follows:

(U.sum . flip M.column 0) a does not fuse. It seems to boil down to:

testRewrite1 :: Matrix -> Double  --fuses
testRewrite1 (Matrix r c v) = U.sum . flip (\u j -> U.generate r (\i -> u `U.unsafeIndex` (j + i * c))) 0 $ v

testRewrite2 :: Matrix -> Double -- does NOT fuse
testRewrite2 m = U.sum . flip (\(Matrix r c v) j -> U.generate r (\i -> v `U.unsafeIndex` (j + i * c))) 0 $ m

note: the flip isn't important, it's just by convenience, since this is from https://github.com/Magalame/fastest-matrices

So the thing that seems to happen is that stream fusion cannot "go through" Matrix's, I'm not sure exactly why

datasets: fix benchmark dataset folder

When running stack bench I get

bench: /Users/ocramz/.cache/datasets-hs/cifar-10-imagefolder/Truck: getDirectoryContents:openDirStream: does not exist (No such file or directory)

I guess it's a matter of copying the test data in a temporary directory before these tests.

analyze : evaluate `streaming` for RFrame

The RFrame type currently stores the frame entries as a Vector of Vectors (each inner vector being a data row). It would be nice to evaluate the performance of this way of storing with that of a streaming library (e.g. Stream (Of (Vector v)) m ()).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.