juliaml / mlutils.jl Goto Github PK

View Code? Open in Web Editor NEW

107.0 107.0 20.0 612 KB

Utilities and abstractions for Machine Learning tasks

License: MIT License

Julia 100.00%

mlutils.jl's People

Contributors

Stargazers

Watchers

mlutils.jl's Issues

`mapobs` doesn't work with named tuples

If getobs returns a named tuple, then mapobs will fail.

API for iterator/view variants

Re #33, what API do we want to go with for the parallel eachobs? I think having eachobs(...; parallel=false) would be cleanest.

Same question regarding the collated BatchView in #63: should we add a collate keyword to BatchView?

Archiving DataLoaders.jl

Now that MLUtils.jl has feature parity with DataLoaders.DataLoader, it's time to deprecate DataLoaders.jl.

We should:

add a guide for users transitioning that explains the differences between DataLoaders.DataLoader and MLUtils.DataLoader.
port and adapt relevant documentation, including:
- notice for people that are looking for torch.utils.data.DataLoader and the comparison to PyTorch
- the collating/batching guide
- guide on larger-than-memory datasets (should be updated to use MLDatasets.FileDataset and mapobs instead of custom struct)
go through DataLoaders.jl issues to search for feature requests that we can move to MLUtils.jl, i.e. lorenzoh/DataLoaders.jl#32 -> #68

TO PORT

noisy_sin -> now Datasets.make_sin #19
noisy_poly
noisy_spiral
load_iris #18

NOT TO PORT

predict, predict! (this interface exists already in StatsAPI)
fit (this interface exists already in StatsAPI)
noisy_function
expand_poly (rethink this based on scikitlearn's PolynomialFeatures)
center!, rescale!, FeatureNormalizer. For the time being normalise seems enough.
load_line, load_sin, load_poly, load_spiral. These can be generated on the fly with make_*.

Adding a similar functionality like sklearn.make_regression, make_moons etc.

I had this on my for a long time. Since I don't find any methods (Please provide the link if these already exist) for that kind of function I guess it's better to add these. I'm not sure if this is the right place to put this.

Implementing `rand_like`

I was working on an implementation of rand_like, similar to that in PyTorch and I was unsure about how to handle RNGs. Right now, MLUtils does not depend on CUDA, so handling RNGs for CuArrays will be a problem because CUDA.default_rng() will not be accessible. Is there a way to write this so that the function still works for both CPU and GPU without introducing a CUDA.jl dependency?

disallow generic iterables in batch in favor of vector

Shuffling a BatchView gives an ObsView of the data

The following gives a shuffled ObsView of the underlying data instead of shuffling the underlying data but maintaining the batch view

 using MLUtils
x = rand(100)
x_train, x_val = splitobs(x; at=0.7)
dl = BatchView(x_train; batchsize=10)
dl = shuffleobs(dl)

Gives:

julia> shuffleobs(dl)
70-element view(::Vector{Float64}, [31, 32, 33, 34, 35, 36, 37, 38, 39, 40  …  61, 62, 63, 64, 65, 66, 67, 68, 69, 70]) with eltype Float64:
 0.863309839047254
 0.09280526616938278
 0.9567810179363109
 0.5684046270934635
....

Instead of something more akin to BatchView(shuffleobs(dl.data); batchsize=dl.batchsize, partial=dl.partial)

Unnecessary allocation in getobs! for arrays

Hi, I think that the current implementation of getobs! for arrays is not optimal, since it allocates a new array.

function getobs!(buffer::AbstractArray, A::AbstractArray)
    buffer .= A
    return buffer
end

function getobs!(buffer::AbstractArray, A::AbstractArray{<:Any, N}, idx) where N
    I = ntuple(_ -> :, N-1)
    buffer .= A[I..., idx]
    return buffer
end

This can be easily fixed using a view instead of normal indexing, but I think it's better to use the following implementation, which uses copyto!. But I'm not entirely sure if it's safe to use copyto! this way.

function getobs_new!(buffer::AbstractArray, A::AbstractArray)
    Base.setindex_shape_check(buffer, size(A)...)
    copyto!(buffer, A)
    return buffer
end

function getobs_new!(buffer::AbstractArray, A::AbstractArray{<:Any, N}, idx) where N
    I = ntuple(_ -> :, N-1)
    src = view(A, I..., idx)
    Base.setindex_shape_check(buffer, size(src)...)
    copyto!(buffer, src)
    return buffer
end

The main advantage is that CUDA.jl defines copyto! for CuArray, and therefore getobs_new! should work if the buffer is CuArray and A is a standard array. However, copyto! for CuArray is currently only implemented for Array and not SubArray, so getobs_new! with idx does not work properly. Below is a simple comparison of both implementations

using MLUtils

shape = (256, 256, 3);
buffer1 = rand(shape..., 64);
buffer2 = rand(shape..., 64);
x = rand(shape..., 100);
idx = rand(1:100, 64);

julia> buffer1 == buffer2
false

julia> getobs!(buffer1, x, idx);

julia> getobs_new!(buffer2, x, idx);

julia> buffer1 == buffer2
true

Simple Benchmarks:

julia> using BenchmarkTools

julia> @benchmark getobs!($buffer1, $x, $idx)
BenchmarkTools.Trial: 85 samples with 1 evaluation.
 Range (min … max):  52.979 ms … 71.178 ms  ┊ GC (min … max): 0.00% … 19.13%
 Time  (median):     55.086 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   59.376 ms ±  6.919 ms  ┊ GC (mean ± σ):  7.73% ±  9.42%

    █  ▇                                              ▃        
  ▆█████▇▇▆▅▁▅▃▁▁▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▅▆▅▇██▁▃▁▃▃ ▁
  53 ms           Histogram: frequency by time        71.2 ms <

 Memory estimate: 96.00 MiB, allocs estimate: 3.

julia> @benchmark getobs_new!($buffer1, $x, $idx)
BenchmarkTools.Trial: 128 samples with 1 evaluation.
 Range (min … max):  37.904 ms …  43.696 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     38.963 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   39.047 ms ± 661.120 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▂     ██ ▂▄ ▆   ▂   ▆▆ ▂ ▂                        
  ▄▁▄▄▁▆▄▆█▄▆▄██▆▁▆██████▆██▆▆███▆████▄█▄█▆█▆▁▄▄▄▁▄▁▆▁▄▁▁▄▁▁▄▄ ▄
  37.9 ms         Histogram: frequency by time         40.4 ms <

 Memory estimate: 48 bytes, allocs estimate: 1.

Usage of CUDA

julia> using CUDA

julia> getobs!(CuArray(buffer1), x[:,:,:,idx]);
ERROR: GPU compilation of kernel #broadcast_kernel#17(CUDA.CuKernelContext, CuDeviceArray{Float64, 4, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{4}, NTuple{4, Base.OneTo{Int64}}, typeof(identity), Tuple{Base.Broadcast.Extruded{Array{Float64, 4}, NTuple{4, Bool}, NTuple{4, Int64}}}}, Int64) failed
KernelError: passing and using non-bitstype argument

Argument 4 to your kernel function is of type Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{4}, NTuple{4, Base.OneTo{Int64}}, typeof(identity), Tuple{Base.Broadcast.Extruded{Array{Float64, 4}, NTuple{4, Bool}, NTuple{4, Int64}}}}, which is not isbits:
  .args is of type Tuple{Base.Broadcast.Extruded{Array{Float64, 4}, NTuple{4, Bool}, NTuple{4, Int64}}} which is not isbits.
    .1 is of type Base.Broadcast.Extruded{Array{Float64, 4}, NTuple{4, Bool}, NTuple{4, Int64}} which is not isbits.
      .x is of type Array{Float64, 4} which is not isbits.

julia> getobs_new!(CuArray(buffer2), x, 1:64);
┌ Warning: Performing scalar indexing on task Task (runnable) @0x00007fa662080010.
│ Invocation of setindex! resulted in scalar indexing of a GPU array.
│ This is typically caused by calling an iterating implementation of a method.
│ Such implementations *do not* execute on the GPU, but very slowly on the CPU,
│ and therefore are only permitted from the REPL for prototyping purposes.
│ If you did intend to index this array, annotate the caller with @allowscalar.
└ @ GPUArrays ~/.julia/packages/GPUArrays/Zecv7/src/host/indexing.jl:56
      
julia> buffer_cu = getobs_new!(CuArray(buffer2), x[:,:,:,1:64]);

julia> Array(buffer_cu) == x[:,:,:,1:64]
true

Can MLUtils play nicely with Tables.jl?

I think one could get greatly increase buy-in for MLUtil.jl if every Tables.jl compatible table would automatically implement the "data container" API. To get performance, one would still want to implement the concrete table types as well, but having it "just work" for all tables would be nice. I guess, since "table" is itself just an interface, rather than an abstract type, this would need to be implemented as part of the data container API, right? As Tables.jl is very lightweight, I don't see that as a big issue (and I could probably find someone to help with the integration).

Even so, there seems to be a problem implementing the interface for certain tables. MLUtils.jl interprets tuples in a very specific way. For example shuffleobs((x1, x2)) treats x1 and x2 as separate data containers, which are to be shuffled simultaneously, with the same base observation index shuffle. But some tables are tuples. The following example is even a tuple-table whose elements are themselves tables (of a different type):

julia> X
((a = [1, 3], b = [2, 3]), (a = [2, 5], b = [4, 7]))

julia> Tables.istable(X)
true

So is such a tuple a pair of data containers or a single data container? The current API cannot distinguish them.

I wonder:

How attached are people to current tuple-based dispatch for coupled multi-container processing?
Is there a big use-case for tables that are also tuples? @quinnj

Possibly this discussion is related.

Tables that are tuples are problematic elsewhere.

@oxinabox @rikhuijzer @darsnack

Error in installing because v0.1.1 doesn't contain `src/Datasets/datasets.jl`

From FluxML/Metalhead.jl#105 (comment): A little digging showed that this file, while present in master, isn't in the release version - the source code downloaded from https://github.com/JuliaML/MLUtils.jl/releases doesn't have it either and it causes an error while installing the package

kfold time series

Hi,

have you thought about porting some Time Series utility functions? Such as kfold for time series?

https://alan-turing-institute.github.io/MLJ.jl/stable/evaluating_model_performance/#MLJBase.TimeSeriesCV

julia> MLJBase.train_test_pairs(TimeSeriesCV(nfolds=3), 1:10)
3-element Vector{Tuple{UnitRange{Int64}, UnitRange{Int64}}}:
 (1:4, 5:6)
 (1:6, 7:8)
 (1:8, 9:10)

Thanks.

CI test in multithreaded environment

Since we have a multithreaded dataloader (named eachobsparallel for the time being), we should also test in a multithreaded environment in github actions CI.

related to #80

Link to docs broken

Why the documentation action is not working
https://juliaml.github.io/MLUtils.jl/dev

circularity in AbstractDataContainer definitions

We currently have the following definitions for AbstractDataContainer where getindex falls back to getobs

abstract type AbstractDataContainer end

Base.getindex(x::AbstractDataContainer, i) = getobs(x, i)
Base.length(x::AbstractDataContainer) = numobs(x)
Base.size(x::AbstractDataContainer) = (length(x),)

Base.iterate(x::AbstractDataContainer, state = 1) =
    (state > length(x)) ? nothing : (x[state], state + 1)
Base.lastindex(x::AbstractDataContainer) = length(x)

and on the other end we have the generic fallback for getobs

getobs(x, i) = getindex(x, i)
numobs(x) = length(x)

I find this circularity a bit confusing and think it should be avoided. I suggest we change AbstractDataContainer to

abstract type AbstractDataContainer end

Base.iterate(x::AbstractDataContainer, state = 1) =
    (state > numobs(x)) ? nothing : (getobs(x, state), state + 1)
Base.lastindex(x::AbstractDataContainer) = numobs(x)

Then types inheriting from AbstractDataContainer:

Can implement getindex if they want both the "indexing" interface and the "observable" interface.
Implement just getobs if for some reason they don't want to expose an indexing interface
Implement both getobs and getindex if the two interfaces serve different purposes (e.g. as with arrays)

As an addendum, let me remark that with the getobs(x, i) = getindex(x, i) fallback we are basically saying that we consider a Dataset any type implementing getindex, which is something that maybe we should document more. Should defining getindex be the recommended way for defining custom dataset types (even if not subtyping AbstractDataContainer)?

Scope of this package

This package is kickstarting the plan outlined in JuliaML/LearnBase.jl#49

For the moment we can add both definitions and implementations here, at some point we will move the basic definition to a LearnAPI.jl package (to be created).
We can gradually move here the functionality from MLLabelUtils and MLDataPattern. We mainly want to add the functionality that is effectively in use in the ecosystem (e.g. FastAI.jl and MLJ.jl) and leave the rest out to avoid extra maintenance complexity.

@darsnack @johnnychen94

Reexport OneHotArrays.jl

This seems broadly useful enough that I think we should reexport it.

DataLoader.nobs could make use of `partial` flag to return final number of samples being used ?

I have recently started learning Flux for doing Deep Learning and came across this unique behavior of DataLoader object.
if we create a DataLoader object as follows ->

julia> dl = DataLoader(rand(Int8, 10, 64), batchsize=30, partial=true)
DataLoader{Matrix{Int8}, Random._GLOBAL_RNG}(Int8[-73 -49 … 65 57; 82 -99 … -125 -72; … ; -109 23 … 14 -68; -60 -90 … -121 70], 30, 64, true, false, Random._GLOBAL_RNG())

julia> dl.nobs
64

We get total number of samples that will be used by DataLoader as 64 which is correct.

But when we set partial=false, we would get the same behavior as explained above w.r.t. dl.nobs being set to same value 64.

My expectation in the latter scenario would be to set dl.nobs to 60, because we will be throwing away last 4 samples (dropping last mini-batch).

As i couldn't able to find the docs for dl.nobs, this is my current understanding, please correct me if I'm missing something obvious here.

And, if my understanding is correct, there could possibly be 2 changes needed in main/src/dataloader.jl file as follows ->

function DataLoader(data; batchsize=1, shuffle=false, partial=true, rng=GLOBAL_RNG)
    batchsize > 0 || throw(ArgumentError("Need positive batchsize"))
    nobs = numobs(data)
+   partial || (nobs -= nobs % batchsize)     # subtract last mini-batch samples when `partial=false`
    if nobs < batchsize
        @warn "Number of observations less than batchsize, decreasing the batchsize to $nobs"
        batchsize = nobs
    end
    DataLoader(data, batchsize, nobs, partial, shuffle, rng)
end

function Base.length(d::DataLoader)
    n = d.nobs / d.batchsize
-   d.partial ? ceil(Int, n) : floor(Int, n)  # removing this line as we would get correct `n`
end

I'm new to Julia, looking forward to learn and improve😃🤞

add Flux.jl downstream test

broken inferred tests for `stack` and `batch`

julia> using MLUtils, Test

julia> x = [[1,2,3], [4,5,6]]
2-element Vector{Vector{Int64}}:
 [1, 2, 3]
 [4, 5, 6]

julia> @inferred batch(x)
ERROR: return type Matrix{Int64} does not match inferred return type Any
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] top-level scope
   @ REPL[4]:1

julia> @inferred stack(x; dims=1)
ERROR: return type Matrix{Int64} does not match inferred return type Any
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] top-level scope
   @ REPL[5]:1

The problem could be due to a brodcasted call to unsqueeze, on which stack relies:

julia> @inferred unsqueeze(x[1]; dims=1)
1×3 Matrix{Int64}:
 1  2  3

julia> f() = unsqueeze.(x; dims=1)
f (generic function with 1 method)

julia> @inferred f()
ERROR: return type Vector{Matrix{Int64}} does not match inferred return type Any
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] top-level scope
   @ REPL[13]:1
``

Drop ObsView etc.

Should we drop DataView and ObsView? Most of the data iterators can be implemented without them, and I think we should minimize the number of extraneous (abstract) types.

mapobs problems with composite datasets and vector indexes

With tuples:

julia> t = (zeros(3), ones(3))
([0.0, 0.0, 0.0], [1.0, 1.0, 1.0])

julia> m = mapobs(x -> (x[1], 2*x[2]), t)
mapobs(#15, Tuple{Vector{Float64}, Vector{Float64}})

julia> m[1]
(0.0, 2.0)

julia> m[2]
(0.0, 2.0)

julia> m[1:2]
((0.0, 0.0), (1.0, 2.0))  # expected ([0.0, 0.0], [2.0, 2.0])

With named tuples:

julia> t = (a=zeros(3),b=ones(3))
(a = [0.0, 0.0, 0.0], b = [1.0, 1.0, 1.0])

julia> m = mapobs(x -> (x[1], 2*x[2]), t)
mapobs(#17, NamedTuple{(:a, :b), Tuple{Vector{Float64}, Vector{Float64}}})

julia> m[1]
(0.0, 2.0)

julia> m[2]
(0.0, 2.0)

julia> m[1:2]
ERROR: ArgumentError: broadcasting over dictionaries and `NamedTuple`s is reserved
Stacktrace:
 [1] broadcastable(#unused#::NamedTuple{(:a, :b), Tuple{Vector{Float64}, Vector{Float64}}})
   @ Base.Broadcast ./broadcast.jl:705
 [2] broadcasted
   @ ./broadcast.jl:1295 [inlined]
 [3] getindex(data::MLUtils.MappedData{var"#17#18", NamedTuple{(:a, :b), Tuple{Vector{Float64}, Vector{Float64}}}}, idxs::UnitRange{Int64})
   @ MLUtils ~/.julia/packages/MLUtils/8OXl7/src/obstransform.jl:14
 [4] top-level scope
   @ REPL[35]:1

The fix could be turning

MLUtils.jl/src/obstransform.jl

Line 14 in 1da3c53

    
           Base.getindex(data::MappedData, idxs::AbstractVector) = data.f.(getobs(data.data, idxs))

into

Base.getindex(data::MappedData, idxs::AbstractVector) = Flux.batch([data.f(getobs(data.data, i)) for i in idxs])

extend batch to handle vectors of tuples, named tuples, dicts

PR welcome on stratified K-folds?

I can make a first-round PR, then willing to make whatever changes necessary, if anyone is willing to coach me through it/ look at my PR. thx.

Port parallel loaders from DataLoaders.jl

This is the second part of porting functionality from DataLoaders.jl.

This one includes the parallel loaders:

GetObsParallel
BufferGetObsParallel

With this also comes the question of what the interface for buffered container views will look like. In general, the pattern is wanting to get an iterator over observations in a data container with the 4 combinations arising from buffer/no buffer and parallel/single-threaded.
Can we brainstorm on a consistent interface here? First that comes to mind is having a single eachobs function with keyword arguments eachobs(data; buffered = false, parallel = false). This could then also give warnings when you pass parallel = true when Threads.nthreads() == 1 and such.

@darsnack @CarloLucibello

add gradient tests

We should add gradient tests to functions that could be part of a trained model:

chunk #47
unsqueeze
stack, unstack

use keyword `dims` for unsqueeze, stack, unstack

Lazy operations should be opt-in

Operation like splitobs, shuffleobs and many more return ObsViews that one has to call getobs on in order to materialize.
I think this is unexpected for users coming from scikit-learn and mildly annoying in most scenarios.
As a default, operations on materialized objects should return materialized objects (e.g. arrays and dataframes).
Users will be able to opt-in on the "lazy" by wrapping data in a ObsView. Operations on ObsView will produce other ObsView that can be materialized only at the end of the pipeline.

conflict on function names with Flux.jl

WARNING: Method definition rpad(AbstractArray{T, 1} where T, Integer, Any) in module MLUtils at /home/admin/.julia/packages/MLUtils/OojOS/src/utils.jl:384 overwritten in module Flux at /home/admin/.julia/packages/Flux/7nTyc/src/utils.jl:610.
  ** incremental compilation may be fatally broken for this module **

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Status of MLLabelUtils porting

TO PORT

NOT TO PORT

add multitaksing to DataLoader

Port the DataLoader from Flux and extend it with the multitasking features of
https://github.com/lorenzoh/DataLoaders.jl

define `numobs` and `getobs` fallbacks for Tables.jl's tables?

We could implement a generic fallback like:

function numobs(x) 
  if Tables.istable(x)
    length(Tables.rows(x))
  else
    error("numobs not defined")
  end
end

function getobs(x, i) 
  if Tables.istable(x)
    Tables.rows(x)[i]
  else
    error("numobs not defined")
  end
end

`eachobs(;batchsize)` vs `BatchView(;batchsize)` vs `DataLoader(;batchsize)`

As the title suggests, I am wondering why there are three ways to iterate over batches:

using MLUtils
X = rand(4, 100)

it1 = eachobs(X, batchsize=10)
it2 = BatchView(X, batchsize=10)
it3 = DataLoader(X, batchsize=10)

for (x1, x2, x3) in zip(it1, it2, it3)
    @assert size(x1) == size(x2) == size(x3)
    @assert x1 == x2 == x3
end

Looking at the implementation, eachobs is implemented using BatchView, and DataLoader uses eachobs. So pardon my ignorant question but why not to have just one way of batch iteration providing all the features [shuffling, (partial) batching, etc.]?

The fallback result of `batch` over `AbstractArray{<:Number}`

julia> x = rand(2,3,4);

julia> batch(x) |> size
(24,)

It seems batch flatten the array here. I think it would make more sense to keep the array as it is here?

batch(x::AbstractArray{<:Number}) = x

Or at least keep the original size?

DataLoader: partial = false returns incorrect batch sizes introduced in v0.2.7

With MLUtils v0.2.7:

partial = false: error (should be 2 batches of nobs=5)

julia> using Flux
       x = rand(4,12)
       dloader = Flux.Data.DataLoader(x, batchsize=5, shuffle=true, partial=false)
       for d in dloader
           println(size(d))
       end
(4, 5)
(4, 3)

partial = true: ✓

julia> using Flux
       x = rand(4,12)
       dloader = Flux.Data.DataLoader(x, batchsize=5, shuffle=true, partial=true)
       for d in dloader
           println(size(d))
       end
(4, 5)
(4, 5)
(4, 2)

With MLUtils v0.2.6:

partial = false: ✓

julia> using Flux
       x = rand(4,12)
       dloader = Flux.Data.DataLoader(x, batchsize=5, shuffle=true, partial=false)
       for d in dloader
           println(size(d))
       end
(4, 5)
(4, 5)

partial = true: ✓

julia> using Flux
       x = rand(4,12)
       dloader = Flux.Data.DataLoader(x, batchsize=5, shuffle=true, partial=true)
       for d in dloader
           println(size(d))
       end
(4, 5)
(4, 5)
(4, 2)

add batch and unbatch

Move here Flux.batch and Flux.unbatch and extend them to tuples, named tuples, dicts.

https://github.com/FluxML/Flux.jl/blob/ea26f45a1f4e93d91b1e8942c807f8bf229d5775/src/utils.jl#L559

Tables.jl and DataAPI.jl interoperation

@ablaom I am not sure if this is the best place to start this discussion, but it is a follow up to https://discourse.julialang.org/t/random-access-to-rows-of-a-table/77386 and JuliaData/Tables.jl#278.

The key point is to avoid creating functions having essentially the same functionalities across DataAPI.jl, Tables.jl, and MLUtils.jl (possibly other ML packages I am not aware of).

Assume for a moment that Tables.jl table is a source of data for some ML model and you want operations to be efficient.

My understanding that your high-level workflow is the following:

the user starts with a Tables.jl table.
then the user does observation subsetting, feature selection, feature transformation operations on this table (either eagerly or lazily).
finally the user transforms the result of step 2 to an object to some other type (again - either lazily or eagerly) to another value that can be accepted as an input by the ML algorithm.

The question is:

What functionalities you need to have in DataAPI.jl and Tables.jl so that it is efficient and you do not need to provide duplicate definitions of concepts in MLUtils.jl (or some other packages)?
Another consideration (raised in the linked discussions) is that I would expect that what we develop is consistent with the interfaces that Base Julia already defines (e.g. iterator interface, abstract vector interface, indexing interface, view interface)

Padded batching

When working with ANN that output different sized vectors depending on the input (for example when using GraphNeuralNetworks.jl), it would be useful to convert the output of a batch to a CuArray in order to perform loss computations.
Current:

julia> MLUtils.batch([[1,2],[3,4]])
2×2 Matrix{Int64}:
 1  3
 2  4

julia> MLUtils.batch([[1,2],[3]])
ERROR: DimensionMismatch("mismatch in dimension 1 (expected 2 got 1)")

Feature:

julia> MLUtils.batch([[1,2],[3]], pad =  0)
2×2 Matrix{Int64}:
 1  3
 2  0

Status of MLDataPattern porting

A list of what is currently exported from MLDataPattern.jl.

TO PORT

NOT TO BE PORTED

BufferGetObs
RandomObs, RandomBatches
BalancedObs
FoldView
targets
eachtarget

Extend function `chunk()` to expect a chunk size parameter?

Hi, I have a use case, where I want to split an Array x into parts in the same way as the function chunk() is doing, but with known chunk size and unknown number of parts. AFAIKS there is no similar function to chunk() in MLUtils.jl which expects the chunk size parameter instead of the number of parts.

Of course, I could write a wrapper around chunk(), which calculates the number of parts from the array size and the desired chunk size. But as can be seen in utils.jl internally the number of parts is already converted to the chunk size. That means writing a wrapper around chunk() would do the conversion from chunk size to the number of parts and the function chunk() would convert the number of parts back to the chunk size.

My question therefore is, if it would be meaningful to provide a similar function like chunk() in MLUtils.jl, which expects a chunk size parameter? The current implementation of chunk() could then be rewritten as a wrapper around this new function, which calculates the chunk size from the number of parts.

add better documentation for batch

See comment here and the related discussion

cc @mcabbott

Deterministic, parallel data iteration

The parallel eachobs implementation is not deterministic in that observations are returned as soon as they are loaded, so they may be returned out of order. This is very performant, and fine for some use cases like training, where data should be shuffled anyway.

To give the option to have a deterministic iteration would be helpful in many use cases, though.

This could be implemented as a wrapper around an existing iterator that does the following:

instead of iterating over data with the wrapped iterator, iterate over (1:nobs(data), data) to preserve ordering information
collect returned observations, stripping the index
return an observation only if all previous (by index) observations have been returned

I am unsure by how much this will affect performance and memory usage and how the interplay is with buffersize. Are there alternative approaches to this implementation?

Port some data container functionality out of FastAI.jl

FastAI.jl currently has some data container functionality that I've found very useful. On the last ML ecosystem call, I and @darsnack + @ToucheSir discussed that it makes sense to have some of that in MLUtils.jl. The relevant FastAI.jl code can be found here: transformations.jl

Specifically, there are some data container transformations that I believe should be ported:

mapobs(f, data), a lazy map over any data container. Generally useful
groupobs(f, data), returns a Dict with keys return values of f(obs) and values a datasubset of obss that returned the same f(obs). Not sure if Dict is the right type here, but NamedTuple is too restrictive. Useful for example to create train/test splits based on some value in each observation.
filterobs(f, data), does what you'd expect, returning a datasubset
joinobs(datas...) treats multiple data containers as a single one. Open to a better name for this.

There are also some data container primitives for working with tables and files, but let's put that into another issue.

needs gpu tests and buildkite CI

`CuArray`s should not follow fast paths in `obsview`

CuArrays should not follow the fast path in obsview that returns a view of the original array, since this triggers scalar indexing for downstream operations.

ref FluxML/Flux.jl#1935

Port collated batch view from DataLoaders.jl

This is the first part of porting functionality from DataLoaders.jl (ref #22).

This includes porting

BatchViewCollated from batchview.jl
- Notes: the BatchDim machinery can probably be dropped if we can assume last dim = batch dim
collate from [https://github.com/lorenzoh/DataLoaders.jl/blob/master/src/collate.jl]

batchview.jl also includes an unimported batchsize helper. Do we want to have a proper batchsize?

`chunk` returns views for CuArrays

See the discussion in FluxML/Metalhead.jl#165
and JuliaGPU/CUDA.jl#1542.

The problem is that we don't want wrapped cuarray types since it will affect the following dispatches, right @theabhirath @darsnack?

using CUDA, MLUtils

julia> x = rand(2, 10) |> cu;

julia> a = chunk(x, 2)[1]  # wrapped cuarray
2×2 view(::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, :, 1:2) with eltype Float32:
 0.545259  0.936495
 0.174116  0.514381

julia> view(x, :, 1:2) # just a cuarray
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.545259  0.936495
 0.174116  0.514381

juliaml / mlutils.jl Goto Github PK

mlutils.jl's People

Contributors

Stargazers

Watchers

Forkers

mlutils.jl's Issues

TO PORT

NOT TO PORT

TO PORT

NOT TO PORT

TO PORT

NOT TO BE PORTED

Recommend Projects

Recommend Topics

Recommend Org