Code Monkey home page Code Monkey logo

mlutils.jl's Introduction

MLUtils.jl

MLUtils.jl defines interfaces and implements common utilities for Machine Learning pipelines.

Features

  • An extensible dataset interface (numobs and getobs).
  • Data iteration and dataloaders (eachobs and DataLoader).
  • Lazy data views (obsview).
  • Resampling procedures (undersample and oversample).
  • Train/test splits (splitobs)
  • Data partitioning and aggregation tools (batch, unbatch, chunk, group_counts, group_indices).
  • Folds for cross-validation (kfolds, leavepout).
  • Datasets lazy tranformations (mapobs, filterobs, groupobs, joinobs, shuffleobs).
  • Toy datasets for demonstration purpose.
  • Other data handling utilities (flatten, normalise, unsqueeze, stack, unstack).

Examples

Let us take a look at a hello world example to get a feeling for how to use this package in a typical ML scenario.

using MLUtils

# X is a matrix of floats
# Y is a vector of strings
X, Y = load_iris()

# The iris dataset is ordered according to their labels,
# which means that we should shuffle the dataset before
# partitioning it into training- and test-set.
Xs, Ys = shuffleobs((X, Y))

# We leave out 15 % of the data for testing
cv_data, test_data = splitobs((Xs, Ys); at=0.85)

# Next we partition the data using a 10-fold scheme.
for (train_data, val_data) in kfolds(cv_data; k=10)

    # We apply a lazy transform for data augmentation
    train_data = mapobs(xy -> (xy[1] .+ 0.1 .* randn.(), xy[2]),  train_data)

    for epoch = 1:10
        # Iterate over the data using mini-batches of 5 observations each
        for (x, y) in eachobs(train_data, batchsize=5)
            # ... train supervised model on minibatches here
        end
    end
end

In the above code snippet, the inner loop for eachobs is the only place where data other than indices is actually being copied. In fact, while x and y are materialized arrays, all the rest are data views.

Related Packages

MLUtils.jl brings together functionalities previously found in LearnBase.jl , MLDataPattern.jl and MLLabelUtils.jl. These packages are now discontinued.

Other features were ported from the deep learning library Flux.jl, as they are of general use.

MLJ.jl is a more complete package for managing the whole machine learning pipeline if you are looking for a sklearn replacement.

mlutils.jl's People

Contributors

ancapdev avatar carlolucibello avatar darsnack avatar gabrevaya avatar github-actions[bot] avatar karthik-d-k avatar lorenzoh avatar mcabbott avatar neroblackstone avatar pangoraw avatar romeov avatar saransh-cpp avatar theabhirath avatar touchesir avatar vaclavmacha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mlutils.jl's Issues

broken inferred tests for `stack` and `batch`

julia> using MLUtils, Test

julia> x = [[1,2,3], [4,5,6]]
2-element Vector{Vector{Int64}}:
 [1, 2, 3]
 [4, 5, 6]

julia> @inferred batch(x)
ERROR: return type Matrix{Int64} does not match inferred return type Any
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] top-level scope
   @ REPL[4]:1

julia> @inferred stack(x; dims=1)
ERROR: return type Matrix{Int64} does not match inferred return type Any
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] top-level scope
   @ REPL[5]:1

The problem could be due to a brodcasted call to unsqueeze, on which stack relies:

julia> @inferred unsqueeze(x[1]; dims=1)
1×3 Matrix{Int64}:
 1  2  3

julia> f() = unsqueeze.(x; dims=1)
f (generic function with 1 method)

julia> @inferred f()
ERROR: return type Vector{Matrix{Int64}} does not match inferred return type Any
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] top-level scope
   @ REPL[13]:1
``

Implementing `rand_like`

I was working on an implementation of rand_like, similar to that in PyTorch and I was unsure about how to handle RNGs. Right now, MLUtils does not depend on CUDA, so handling RNGs for CuArrays will be a problem because CUDA.default_rng() will not be accessible. Is there a way to write this so that the function still works for both CPU and GPU without introducing a CUDA.jl dependency?

Scope of this package

This package is kickstarting the plan outlined in JuliaML/LearnBase.jl#49

  • For the moment we can add both definitions and implementations here, at some point we will move the basic definition to a LearnAPI.jl package (to be created).
  • We can gradually move here the functionality from MLLabelUtils and MLDataPattern. We mainly want to add the functionality that is effectively in use in the ecosystem (e.g. FastAI.jl and MLJ.jl) and leave the rest out to avoid extra maintenance complexity.

@darsnack @johnnychen94

define `numobs` and `getobs` fallbacks for Tables.jl's tables?

We could implement a generic fallback like:

function numobs(x) 
  if Tables.istable(x)
    length(Tables.rows(x))
  else
    error("numobs not defined")
  end
end

function getobs(x, i) 
  if Tables.istable(x)
    Tables.rows(x)[i]
  else
    error("numobs not defined")
  end
end

Lazy operations should be opt-in

Operation like splitobs, shuffleobs and many more return ObsViews that one has to call getobs on in order to materialize.
I think this is unexpected for users coming from scikit-learn and mildly annoying in most scenarios.
As a default, operations on materialized objects should return materialized objects (e.g. arrays and dataframes).
Users will be able to opt-in on the "lazy" by wrapping data in a ObsView. Operations on ObsView will produce other ObsView that can be materialized only at the end of the pipeline.

PR welcome on stratified K-folds?

I can make a first-round PR, then willing to make whatever changes necessary, if anyone is willing to coach me through it/ look at my PR. thx.

Padded batching

When working with ANN that output different sized vectors depending on the input (for example when using GraphNeuralNetworks.jl), it would be useful to convert the output of a batch to a CuArray in order to perform loss computations.
Current:

julia> MLUtils.batch([[1,2],[3,4]])
2×2 Matrix{Int64}:
 1  3
 2  4

julia> MLUtils.batch([[1,2],[3]])
ERROR: DimensionMismatch("mismatch in dimension 1 (expected 2 got 1)")

Feature:

julia> MLUtils.batch([[1,2],[3]], pad =  0)
2×2 Matrix{Int64}:
 1  3
 2  0

circularity in AbstractDataContainer definitions

We currently have the following definitions for AbstractDataContainer where getindex falls back to getobs

abstract type AbstractDataContainer end

Base.getindex(x::AbstractDataContainer, i) = getobs(x, i)
Base.length(x::AbstractDataContainer) = numobs(x)
Base.size(x::AbstractDataContainer) = (length(x),)

Base.iterate(x::AbstractDataContainer, state = 1) =
    (state > length(x)) ? nothing : (x[state], state + 1)
Base.lastindex(x::AbstractDataContainer) = length(x)

and on the other end we have the generic fallback for getobs

getobs(x, i) = getindex(x, i)
numobs(x) = length(x)

I find this circularity a bit confusing and think it should be avoided. I suggest we change AbstractDataContainer to

abstract type AbstractDataContainer end

Base.iterate(x::AbstractDataContainer, state = 1) =
    (state > numobs(x)) ? nothing : (getobs(x, state), state + 1)
Base.lastindex(x::AbstractDataContainer) = numobs(x)

Then types inheriting from AbstractDataContainer:

  • Can implement getindex if they want both the "indexing" interface and the "observable" interface.
  • Implement just getobs if for some reason they don't want to expose an indexing interface
  • Implement both getobs and getindex if the two interfaces serve different purposes (e.g. as with arrays)

As an addendum, let me remark that with the getobs(x, i) = getindex(x, i) fallback we are basically saying that we consider a Dataset any type implementing getindex, which is something that maybe we should document more. Should defining getindex be the recommended way for defining custom dataset types (even if not subtyping AbstractDataContainer)?

@darsnack
related to JuliaML/MLDatasets.jl#96

Drop ObsView etc.

Should we drop DataView and ObsView? Most of the data iterators can be implemented without them, and I think we should minimize the number of extraneous (abstract) types.

Unnecessary allocation in getobs! for arrays

Hi, I think that the current implementation of getobs! for arrays is not optimal, since it allocates a new array.

function getobs!(buffer::AbstractArray, A::AbstractArray)
    buffer .= A
    return buffer
end

function getobs!(buffer::AbstractArray, A::AbstractArray{<:Any, N}, idx) where N
    I = ntuple(_ -> :, N-1)
    buffer .= A[I..., idx]
    return buffer
end

This can be easily fixed using a view instead of normal indexing, but I think it's better to use the following implementation, which uses copyto!. But I'm not entirely sure if it's safe to use copyto! this way.

function getobs_new!(buffer::AbstractArray, A::AbstractArray)
    Base.setindex_shape_check(buffer, size(A)...)
    copyto!(buffer, A)
    return buffer
end

function getobs_new!(buffer::AbstractArray, A::AbstractArray{<:Any, N}, idx) where N
    I = ntuple(_ -> :, N-1)
    src = view(A, I..., idx)
    Base.setindex_shape_check(buffer, size(src)...)
    copyto!(buffer, src)
    return buffer
end

The main advantage is that CUDA.jl defines copyto! for CuArray, and therefore getobs_new! should work if the buffer is CuArray and A is a standard array. However, copyto! for CuArray is currently only implemented for Array and not SubArray, so getobs_new! with idx does not work properly. Below is a simple comparison of both implementations

using MLUtils

shape = (256, 256, 3);
buffer1 = rand(shape..., 64);
buffer2 = rand(shape..., 64);
x = rand(shape..., 100);
idx = rand(1:100, 64);
julia> buffer1 == buffer2
false

julia> getobs!(buffer1, x, idx);

julia> getobs_new!(buffer2, x, idx);

julia> buffer1 == buffer2
true

Simple Benchmarks:

julia> using BenchmarkTools

julia> @benchmark getobs!($buffer1, $x, $idx)
BenchmarkTools.Trial: 85 samples with 1 evaluation.
 Range (min  max):  52.979 ms  71.178 ms  ┊ GC (min  max): 0.00%  19.13%
 Time  (median):     55.086 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   59.376 ms ±  6.919 ms  ┊ GC (mean ± σ):  7.73% ±  9.42%

    █  ▇                                              ▃        
  ▆█████▇▇▆▅▁▅▃▁▁▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▅▆▅▇██▁▃▁▃▃ ▁
  53 ms           Histogram: frequency by time        71.2 ms <

 Memory estimate: 96.00 MiB, allocs estimate: 3.

julia> @benchmark getobs_new!($buffer1, $x, $idx)
BenchmarkTools.Trial: 128 samples with 1 evaluation.
 Range (min  max):  37.904 ms   43.696 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     38.963 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   39.047 ms ± 661.120 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▂     ██ ▂▄ ▆   ▂   ▆▆ ▂ ▂                        
  ▄▁▄▄▁▆▄▆█▄▆▄██▆▁▆██████▆██▆▆███▆████▄█▄█▆█▆▁▄▄▄▁▄▁▆▁▄▁▁▄▁▁▄▄ ▄
  37.9 ms         Histogram: frequency by time         40.4 ms <

 Memory estimate: 48 bytes, allocs estimate: 1.

Usage of CUDA

julia> using CUDA

julia> getobs!(CuArray(buffer1), x[:,:,:,idx]);
ERROR: GPU compilation of kernel #broadcast_kernel#17(CUDA.CuKernelContext, CuDeviceArray{Float64, 4, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{4}, NTuple{4, Base.OneTo{Int64}}, typeof(identity), Tuple{Base.Broadcast.Extruded{Array{Float64, 4}, NTuple{4, Bool}, NTuple{4, Int64}}}}, Int64) failed
KernelError: passing and using non-bitstype argument

Argument 4 to your kernel function is of type Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{4}, NTuple{4, Base.OneTo{Int64}}, typeof(identity), Tuple{Base.Broadcast.Extruded{Array{Float64, 4}, NTuple{4, Bool}, NTuple{4, Int64}}}}, which is not isbits:
  .args is of type Tuple{Base.Broadcast.Extruded{Array{Float64, 4}, NTuple{4, Bool}, NTuple{4, Int64}}} which is not isbits.
    .1 is of type Base.Broadcast.Extruded{Array{Float64, 4}, NTuple{4, Bool}, NTuple{4, Int64}} which is not isbits.
      .x is of type Array{Float64, 4} which is not isbits.

julia> getobs_new!(CuArray(buffer2), x, 1:64);
┌ Warning: Performing scalar indexing on task Task (runnable) @0x00007fa662080010.
│ Invocation of setindex! resulted in scalar indexing of a GPU array.
│ This is typically caused by calling an iterating implementation of a method.
│ Such implementations *do not* execute on the GPU, but very slowly on the CPU,
│ and therefore are only permitted from the REPL for prototyping purposes.
│ If you did intend to index this array, annotate the caller with @allowscalar.
└ @ GPUArrays ~/.julia/packages/GPUArrays/Zecv7/src/host/indexing.jl:56
      
julia> buffer_cu = getobs_new!(CuArray(buffer2), x[:,:,:,1:64]);

julia> Array(buffer_cu) == x[:,:,:,1:64]
true

mapobs problems with composite datasets and vector indexes

With tuples:

julia> t = (zeros(3), ones(3))
([0.0, 0.0, 0.0], [1.0, 1.0, 1.0])

julia> m = mapobs(x -> (x[1], 2*x[2]), t)
mapobs(#15, Tuple{Vector{Float64}, Vector{Float64}})

julia> m[1]
(0.0, 2.0)

julia> m[2]
(0.0, 2.0)

julia> m[1:2]
((0.0, 0.0), (1.0, 2.0))  # expected ([0.0, 0.0], [2.0, 2.0]) 

With named tuples:

julia> t = (a=zeros(3),b=ones(3))
(a = [0.0, 0.0, 0.0], b = [1.0, 1.0, 1.0])

julia> m = mapobs(x -> (x[1], 2*x[2]), t)
mapobs(#17, NamedTuple{(:a, :b), Tuple{Vector{Float64}, Vector{Float64}}})

julia> m[1]
(0.0, 2.0)

julia> m[2]
(0.0, 2.0)

julia> m[1:2]
ERROR: ArgumentError: broadcasting over dictionaries and `NamedTuple`s is reserved
Stacktrace:
 [1] broadcastable(#unused#::NamedTuple{(:a, :b), Tuple{Vector{Float64}, Vector{Float64}}})
   @ Base.Broadcast ./broadcast.jl:705
 [2] broadcasted
   @ ./broadcast.jl:1295 [inlined]
 [3] getindex(data::MLUtils.MappedData{var"#17#18", NamedTuple{(:a, :b), Tuple{Vector{Float64}, Vector{Float64}}}}, idxs::UnitRange{Int64})
   @ MLUtils ~/.julia/packages/MLUtils/8OXl7/src/obstransform.jl:14
 [4] top-level scope
   @ REPL[35]:1

The fix could be turning

Base.getindex(data::MappedData, idxs::AbstractVector) = data.f.(getobs(data.data, idxs))

into

Base.getindex(data::MappedData, idxs::AbstractVector) = Flux.batch([data.f(getobs(data.data, i)) for i in idxs]) 

Port parallel loaders from DataLoaders.jl

This is the second part of porting functionality from DataLoaders.jl.

This one includes the parallel loaders:

  • GetObsParallel
  • BufferGetObsParallel

With this also comes the question of what the interface for buffered container views will look like. In general, the pattern is wanting to get an iterator over observations in a data container with the 4 combinations arising from buffer/no buffer and parallel/single-threaded.
Can we brainstorm on a consistent interface here? First that comes to mind is having a single eachobs function with keyword arguments eachobs(data; buffered = false, parallel = false). This could then also give warnings when you pass parallel = true when Threads.nthreads() == 1 and such.

@darsnack @CarloLucibello

DataLoader: partial = false returns incorrect batch sizes introduced in v0.2.7

With MLUtils v0.2.7:

partial = false: error (should be 2 batches of nobs=5)

julia> using Flux
       x = rand(4,12)
       dloader = Flux.Data.DataLoader(x, batchsize=5, shuffle=true, partial=false)
       for d in dloader
           println(size(d))
       end
(4, 5)
(4, 3)

partial = true: ✓

julia> using Flux
       x = rand(4,12)
       dloader = Flux.Data.DataLoader(x, batchsize=5, shuffle=true, partial=true)
       for d in dloader
           println(size(d))
       end
(4, 5)
(4, 5)
(4, 2)

With MLUtils v0.2.6:

partial = false: ✓

julia> using Flux
       x = rand(4,12)
       dloader = Flux.Data.DataLoader(x, batchsize=5, shuffle=true, partial=false)
       for d in dloader
           println(size(d))
       end
(4, 5)
(4, 5)

partial = true: ✓

julia> using Flux
       x = rand(4,12)
       dloader = Flux.Data.DataLoader(x, batchsize=5, shuffle=true, partial=true)
       for d in dloader
           println(size(d))
       end
(4, 5)
(4, 5)
(4, 2)

`chunk` returns views for CuArrays

See the discussion in FluxML/Metalhead.jl#165
and JuliaGPU/CUDA.jl#1542.

The problem is that we don't want wrapped cuarray types since it will affect the following dispatches, right @theabhirath @darsnack?

using CUDA, MLUtils

julia> x = rand(2, 10) |> cu;

julia> a = chunk(x, 2)[1]  # wrapped cuarray
2×2 view(::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, :, 1:2) with eltype Float32:
 0.545259  0.936495
 0.174116  0.514381

julia> view(x, :, 1:2) # just a cuarray
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.545259  0.936495
 0.174116  0.514381

Shuffling a BatchView gives an ObsView of the data

The following gives a shuffled ObsView of the underlying data instead of shuffling the underlying data but maintaining the batch view

 using MLUtils
x = rand(100)
x_train, x_val = splitobs(x; at=0.7)
dl = BatchView(x_train; batchsize=10)
dl = shuffleobs(dl)

Gives:

julia> shuffleobs(dl)
70-element view(::Vector{Float64}, [31, 32, 33, 34, 35, 36, 37, 38, 39, 40  …  61, 62, 63, 64, 65, 66, 67, 68, 69, 70]) with eltype Float64:
 0.863309839047254
 0.09280526616938278
 0.9567810179363109
 0.5684046270934635
....

Instead of something more akin to BatchView(shuffleobs(dl.data); batchsize=dl.batchsize, partial=dl.partial)

API for iterator/view variants

Re #33, what API do we want to go with for the parallel eachobs? I think having eachobs(...; parallel=false) would be cleanest.

Same question regarding the collated BatchView in #63: should we add a collate keyword to BatchView?

Extend function `chunk()` to expect a chunk size parameter?

Hi, I have a use case, where I want to split an Array x into parts in the same way as the function chunk() is doing, but with known chunk size and unknown number of parts. AFAIKS there is no similar function to chunk() in MLUtils.jl which expects the chunk size parameter instead of the number of parts.

Of course, I could write a wrapper around chunk(), which calculates the number of parts from the array size and the desired chunk size. But as can be seen in utils.jl internally the number of parts is already converted to the chunk size. That means writing a wrapper around chunk() would do the conversion from chunk size to the number of parts and the function chunk() would convert the number of parts back to the chunk size.

My question therefore is, if it would be meaningful to provide a similar function like chunk() in MLUtils.jl, which expects a chunk size parameter? The current implementation of chunk() could then be rewritten as a wrapper around this new function, which calculates the chunk size from the number of parts.

Can MLUtils play nicely with Tables.jl?

I think one could get greatly increase buy-in for MLUtil.jl if every Tables.jl compatible table would automatically implement the "data container" API. To get performance, one would still want to implement the concrete table types as well, but having it "just work" for all tables would be nice. I guess, since "table" is itself just an interface, rather than an abstract type, this would need to be implemented as part of the data container API, right? As Tables.jl is very lightweight, I don't see that as a big issue (and I could probably find someone to help with the integration).

Even so, there seems to be a problem implementing the interface for certain tables. MLUtils.jl interprets tuples in a very specific way. For example shuffleobs((x1, x2)) treats x1 and x2 as separate data containers, which are to be shuffled simultaneously, with the same base observation index shuffle. But some tables are tuples. The following example is even a tuple-table whose elements are themselves tables (of a different type):

julia> X
((a = [1, 3], b = [2, 3]), (a = [2, 5], b = [4, 7]))

julia> Tables.istable(X)
true

So is such a tuple a pair of data containers or a single data container? The current API cannot distinguish them.

I wonder:

  1. How attached are people to current tuple-based dispatch for coupled multi-container processing?
  2. Is there a big use-case for tables that are also tuples? @quinnj

Possibly this discussion is related.

Tables that are tuples are problematic elsewhere.

@oxinabox @rikhuijzer @darsnack

conflict on function names with Flux.jl

WARNING: Method definition rpad(AbstractArray{T, 1} where T, Integer, Any) in module MLUtils at /home/admin/.julia/packages/MLUtils/OojOS/src/utils.jl:384 overwritten in module Flux at /home/admin/.julia/packages/Flux/7nTyc/src/utils.jl:610.
  ** incremental compilation may be fatally broken for this module **

add gradient tests

We should add gradient tests to functions that could be part of a trained model:

  • chunk #47
  • unsqueeze
  • stack, unstack

Port some data container functionality out of FastAI.jl

FastAI.jl currently has some data container functionality that I've found very useful. On the last ML ecosystem call, I and @darsnack + @ToucheSir discussed that it makes sense to have some of that in MLUtils.jl. The relevant FastAI.jl code can be found here: transformations.jl

Specifically, there are some data container transformations that I believe should be ported:

  • mapobs(f, data), a lazy map over any data container. Generally useful
  • groupobs(f, data), returns a Dict with keys return values of f(obs) and values a datasubset of obss that returned the same f(obs). Not sure if Dict is the right type here, but NamedTuple is too restrictive. Useful for example to create train/test splits based on some value in each observation.
  • filterobs(f, data), does what you'd expect, returning a datasubset
  • joinobs(datas...) treats multiple data containers as a single one. Open to a better name for this.

There are also some data container primitives for working with tables and files, but let's put that into another issue.

`eachobs(;batchsize)` vs `BatchView(;batchsize)` vs `DataLoader(;batchsize)`

As the title suggests, I am wondering why there are three ways to iterate over batches:

using MLUtils
X = rand(4, 100)

it1 = eachobs(X, batchsize=10)
it2 = BatchView(X, batchsize=10)
it3 = DataLoader(X, batchsize=10)

for (x1, x2, x3) in zip(it1, it2, it3)
    @assert size(x1) == size(x2) == size(x3)
    @assert x1 == x2 == x3
end

Looking at the implementation, eachobs is implemented using BatchView, and DataLoader uses eachobs. So pardon my ignorant question but why not to have just one way of batch iteration providing all the features [shuffling, (partial) batching, etc.]?

Status of MLDataPattern porting

A list of what is currently exported from MLDataPattern.jl.

TO PORT

  • getobs, getobs! and nobs. #1
    • nobs is now numobs;
    • obsdim argument is dropped from the interface
  • randobs #1
  • datasubset, DataSubset #4
  • shuffleobs #5
  • splitobs #5
  • DataView #5
    • Consider removal
  • obsview, ObsView #5
    • Consider removal #8
  • batchview, BatchView #6
  • batchsize #6
  • slidingwindow, SlidingWindow
  • stratifiedobs
  • oversample, undersample #10
  • kfolds #9
  • leaveout #9
  • eachobs #9
  • eachbatch #9

NOT TO BE PORTED

  • BufferGetObs
  • RandomObs, RandomBatches
  • BalancedObs
  • FoldView
  • targets
  • eachtarget

Port collated batch view from DataLoaders.jl

This is the first part of porting functionality from DataLoaders.jl (ref #22).

This includes porting

  • BatchViewCollated from batchview.jl
    • Notes: the BatchDim machinery can probably be dropped if we can assume last dim = batch dim
  • collate from [https://github.com/lorenzoh/DataLoaders.jl/blob/master/src/collate.jl]

batchview.jl also includes an unimported batchsize helper. Do we want to have a proper batchsize?

Deterministic, parallel data iteration

The parallel eachobs implementation is not deterministic in that observations are returned as soon as they are loaded, so they may be returned out of order. This is very performant, and fine for some use cases like training, where data should be shuffled anyway.

To give the option to have a deterministic iteration would be helpful in many use cases, though.

This could be implemented as a wrapper around an existing iterator that does the following:

  • instead of iterating over data with the wrapped iterator, iterate over (1:nobs(data), data) to preserve ordering information
  • collect returned observations, stripping the index
  • return an observation only if all previous (by index) observations have been returned

I am unsure by how much this will affect performance and memory usage and how the interplay is with buffersize. Are there alternative approaches to this implementation?

Tables.jl and DataAPI.jl interoperation

@ablaom I am not sure if this is the best place to start this discussion, but it is a follow up to https://discourse.julialang.org/t/random-access-to-rows-of-a-table/77386 and JuliaData/Tables.jl#278.

The key point is to avoid creating functions having essentially the same functionalities across DataAPI.jl, Tables.jl, and MLUtils.jl (possibly other ML packages I am not aware of).

Assume for a moment that Tables.jl table is a source of data for some ML model and you want operations to be efficient.

My understanding that your high-level workflow is the following:

  1. the user starts with a Tables.jl table.
  2. then the user does observation subsetting, feature selection, feature transformation operations on this table (either eagerly or lazily).
  3. finally the user transforms the result of step 2 to an object to some other type (again - either lazily or eagerly) to another value that can be accepted as an input by the ML algorithm.

The question is:

What functionalities you need to have in DataAPI.jl and Tables.jl so that it is efficient and you do not need to provide duplicate definitions of concepts in MLUtils.jl (or some other packages)?
Another consideration (raised in the linked discussions) is that I would expect that what we develop is consistent with the interfaces that Base Julia already defines (e.g. iterator interface, abstract vector interface, indexing interface, view interface)

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Status of MLLabelUtils porting

TO PORT

  • labelmap. Now group_indices. #10
  • labelfreq. Now group_counts
  • labelmap2vec
  • ind2label, label2ind,
  • labeltype,
  • label,
  • nlabel,
  • poslabel, neglabel,
  • isposlabel, isneglabel,
  • classify, classify!,
  • convertlabel,
  • LabelEnc, labelenc, islabelenc,
  • convertlabelview

NOT TO PORT

DataLoader.nobs could make use of `partial` flag to return final number of samples being used ?

I have recently started learning Flux for doing Deep Learning and came across this unique behavior of DataLoader object.
if we create a DataLoader object as follows ->

julia> dl = DataLoader(rand(Int8, 10, 64), batchsize=30, partial=true)
DataLoader{Matrix{Int8}, Random._GLOBAL_RNG}(Int8[-73 -49  65 57; 82 -99  -125 -72;  ; -109 23  14 -68; -60 -90  -121 70], 30, 64, true, false, Random._GLOBAL_RNG())

julia> dl.nobs
64

We get total number of samples that will be used by DataLoader as 64 which is correct.

But when we set partial=false, we would get the same behavior as explained above w.r.t. dl.nobs being set to same value 64.

My expectation in the latter scenario would be to set dl.nobs to 60, because we will be throwing away last 4 samples (dropping last mini-batch).

As i couldn't able to find the docs for dl.nobs, this is my current understanding, please correct me if I'm missing something obvious here.

And, if my understanding is correct, there could possibly be 2 changes needed in main/src/dataloader.jl file as follows ->

function DataLoader(data; batchsize=1, shuffle=false, partial=true, rng=GLOBAL_RNG)
    batchsize > 0 || throw(ArgumentError("Need positive batchsize"))
    nobs = numobs(data)
+   partial || (nobs -= nobs % batchsize)     # subtract last mini-batch samples when `partial=false`
    if nobs < batchsize
        @warn "Number of observations less than batchsize, decreasing the batchsize to $nobs"
        batchsize = nobs
    end
    DataLoader(data, batchsize, nobs, partial, shuffle, rng)
end
function Base.length(d::DataLoader)
    n = d.nobs / d.batchsize
-   d.partial ? ceil(Int, n) : floor(Int, n)  # removing this line as we would get correct `n`
end

I'm new to Julia, looking forward to learn and improve😃🤞

Archiving DataLoaders.jl

Now that MLUtils.jl has feature parity with DataLoaders.DataLoader, it's time to deprecate DataLoaders.jl.

We should:

  • add a guide for users transitioning that explains the differences between DataLoaders.DataLoader and MLUtils.DataLoader.
  • port and adapt relevant documentation, including:
  • go through DataLoaders.jl issues to search for feature requests that we can move to MLUtils.jl, i.e. lorenzoh/DataLoaders.jl#32 -> #68

CI test in multithreaded environment

Since we have a multithreaded dataloader (named eachobsparallel for the time being), we should also test in a multithreaded environment in github actions CI.

related to #80

Status of MLDataUtils porting

TO PORT

  • noisy_sin -> now Datasets.make_sin #19
  • noisy_poly
  • noisy_spiral
  • load_iris #18

NOT TO PORT

  • predict, predict! (this interface exists already in StatsAPI)
  • fit (this interface exists already in StatsAPI)
  • noisy_function
  • expand_poly (rethink this based on scikitlearn's PolynomialFeatures)
  • center!, rescale!, FeatureNormalizer. For the time being normalise seems enough.
  • load_line, load_sin, load_poly, load_spiral. These can be generated on the fly with make_*.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.