Code Monkey home page Code Monkey logo

diskarrays.jl's Introduction

DiskArrays.jl

Lifecycle Stable Docs Dev Docs CI Codecov

This package provides a collection of utilities for working with n-dimensional array-like data structures that do have considerable overhead for single read operations. Most important examples are arrays that represent data on hard disk that are accessed through a C library or that are compressed in chunks. It can be inadvisable to make these arrays a direct subtype of AbstractArray many functions working with AbstractArrays assume fast random access into single values (including basic things like getindex, show, reduce, etc...).

Currently supported features are:

  • getindex/setindex with the same rules as base (trailing or singleton dimensions etc)
  • views into DiskArrays
  • a fallback Base.show method that does not call getindex repeatedly
  • implementations for mapreduce and mapreducedim, that respect the chunking of the underlying dataset. This greatly increases performance of higher-level reductions like sum(a,dims=d)
  • an iterator over the values of a DiskArray that caches a chunk of data and returns the values within. This allows efficient usage of e.g. using DataStructures; counter(a)
  • customization of broadcast when there is a DiskArray on the LHS. This at least makes things like a.=5 possible and relatively fast

AbstractDiskArray Interface definition

Package authors who want to use this library to make their disk-based array an AbstractDiskArray should at least implement methods for the following functions:

Base.size(A::CustomDiskArray)
readblock!(A::CustomDiskArray{T,N},aout,r::Vararg{AbstractUnitRange,N})
writeblock!(A::CustomDiskArray{T,N},ain,r::Vararg{AbstractUnitRange,N})

Here readblock! will read a subset of array A in a hyper-rectangle defined by the unit ranges r. The results shall be written into aout. writeblock! should write the data given by ain into the (hyper-)rectangle of A defined by r When defining the functions it can be safely assumed that length(r) == ndims(A) as well as size(ain) == length.(r). However, bounds checking is not performed by the DiskArray machinery and currently should be done by the implementation.

If the data on disk has rectangular chunks as underlying storage units, you should addtionally implement the following methods to optimize some operations like broadcast, reductions and sparse indexing:

DiskArrays.haschunks(A::CustomDiskArray) = DiskArrays.Chunked()
DiskArrays.eachchunk(A::CustomDiskArray) = DiskArrays.GridChunks(A, chunksize)

where chunksize is a int-tuple of chunk lengths. If the array does not have an internal chunking structure, one should define

DiskArrays.haschunks(A::CustomDiskArray) = DiskArrays.Unchunked()

Implementing only these methods makes all kinds of strange indexing patterns work (Colons, StepRanges, Integer vectors, Boolean masks, CartesianIndices, Arrays of CartesianIndex, and mixtures of all these) while making sure that as few readblock! or writeblock! calls as possible are performed by reading a rectangular bounding box of the required array values and re-arranging the resulting values into the output array.

In addition, DiskArrays.jl provides a few optimizations for sparse indexing patterns to avoid reading and discarding too much unnecessary data from disk, for example for indices like A[:,:,[1,1500]].

Example

Here we define a new array type that wraps a normal AbstractArray. The only access method that we define is a readblock! function where indices are strictly given as unit ranges along every dimension of the array. This is a very common API used in libraries like HDF5, NetCDF and Zarr. We also define a chunking, which will control the way iteration and reductions are computed. In order to understand how exactly data is accessed, we added the additional print statements in the readblock! and writeblock! functions.

using DiskArrays

struct PseudoDiskArray{T,N,A<:AbstractArray{T,N}} <: AbstractDiskArray{T,N}
  parent::A
  chunksize::NTuple{N,Int}
end
PseudoDiskArray(a;chunksize=size(a)) = PseudoDiskArray(a,chunksize)
haschunks(a::PseudoDiskArray) = Chunked()
eachchunk(a::PseudoDiskArray) = GridChunks(a,a.chunksize)
Base.size(a::PseudoDiskArray) = size(a.parent)
function DiskArrays.readblock!(a::PseudoDiskArray,aout,i::AbstractUnitRange...)
  ndims(a) == length(i) || error("Number of indices is not correct")
  all(r->isa(r,AbstractUnitRange),i) || error("Not all indices are unit ranges")
  println("Reading at index ", join(string.(i)," "))
  aout .= a.parent[i...]
end
function DiskArrays.writeblock!(a::PseudoDiskArray,v,i::AbstractUnitRange...)
  ndims(a) == length(i) || error("Number of indices is not correct")
  all(r->isa(r,AbstractUnitRange),i) || error("Not all indices are unit ranges")
  println("Writing to indices ", join(string.(i)," "))
  view(a.parent,i...) .= v
end
a = PseudoDiskArray(rand(4,5,1))
Disk Array with size 10 x 9 x 1

Now all the Base indexing behaviors work for our array, while minimizing the number of reads that have to be done:

a[:,3]
Reading at index Base.OneTo(10) 3:3 1:1

10-element Array{Float64,1}:
 0.8821177068878834
 0.6220977650963209
 0.22676949571723437
 0.3177934541451004
 0.08014908894614026
 0.9989838001681182
 0.5865160181790519
 0.27931778627456216
 0.449108677620097  
 0.22886146620923808

As can be seen from the read message, only a single call to readblock is performed, which will map to a single call into the underlying C library.

mask = falses(4,5,1)
mask[3,2:4,1] .= true
a[mask]
3-element Array{Int64,1}:
 6
 7
 8

One can check in a similar way, that reductions respect the chunks defined by the data type:

sum(a,dims=(1,3))
Reading at index 1:5 1:3 1:1
Reading at index 6:10 1:3 1:1
Reading at index 1:5 4:6 1:1
Reading at index 6:10 4:6 1:1
Reading at index 1:5 7:9 1:1
Reading at index 6:10 7:9 1:1

1×9×1 Array{Float64,3}:
[:, :, 1] =
 6.33221  4.91877  3.98709  4.18658  …  6.01844  5.03799  3.91565  6.06882

When a DiskArray is on the LHS of a broadcasting expression, the results with be written chunk by chunk:

va = view(a,5:10,5:8,1)
va .= 2.0
a[:,:,1]
Writing to indices 5:5 5:6 1:1
Writing to indices 6:10 5:6 1:1
Writing to indices 5:5 7:8 1:1
Writing to indices 6:10 7:8 1:1
Reading at index Base.OneTo(10) Base.OneTo(9) 1:1

10×9 Array{Float64,2}:
 0.929979   0.664717  0.617594  0.720272   …  0.564644  0.430036  0.791838
 0.392748   0.508902  0.941583  0.854843      0.682924  0.323496  0.389914
 0.761131   0.937071  0.805167  0.951293      0.630261  0.290144  0.534721
 0.332388   0.914568  0.497409  0.471007      0.470808  0.726594  0.97107
 0.251657   0.24236   0.866905  0.669599      2.0       2.0       0.427387
 0.388476   0.121011  0.738621  0.304039   …  2.0       2.0       0.687802
 0.991391   0.621701  0.210167  0.129159      2.0       2.0       0.733581
 0.371857   0.549601  0.289447  0.509249      2.0       2.0       0.920333
 0.76309    0.648815  0.632453  0.623295      2.0       2.0       0.387723
 0.0882056  0.842403  0.147516  0.0562536     2.0       2.0       0.107673

Accessing strided Arrays

There are situations where one wants to read every other value along a certain axis or provide arbitrary strides. Some DiskArray backends may want to provide optimized methods to read these strided arrays. In this case a backend can define readblock!(a,aout,r::OrdinalRange...) and the respective writeblock method which will overwrite the fallback behavior that would read the whol block of data and only return the desired range.

Arrays that do not implement eachchunk

There are arrays that live on disk but which are not split into rectangular chunks, so that the haschunks trait returns Unchunked(). In order to still enable broadcasting and reductions for these arrays, a chunk size will be estimated in a way that a certain memory limit per chunk is not exceeded. This memory limit defaults to 100MB and can be modified by changing DiskArrays.default_chunk_size[]. Then a chunk size is computed based on the element size of the array. However, there are cases where the size of the element type is undefined, e.g. for Strings or variable-length vectors. In these cases one can overload the DiskArrays.element_size function for certain container types which returns an approximate element size (in bytes). Otherwise the size of an element will simply be assumed to equal the value stored in DiskArrays.fallback_element_size which defaults to 100 bytes.

diskarrays.jl's People

Contributors

alexander-barth avatar felixcremer avatar gdkrmr avatar github-actions[bot] avatar juliatagbot avatar meggart avatar mtfishman avatar rafaqz avatar tcarion avatar visr avatar vtjnash avatar wkearn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

diskarrays.jl's Issues

DiskArrays interface for files that are not yet open

I'm implementing a DiskArrays.jl interface in GeoData.jl. But it doesn't keep files open - they are opened when data is required, and closed again afterwards.

The issue is passing opened files to DiskArrays to apply chunked read/write e.g. during broadcast, and closing the file again afterwards. Using closures or some kind of callback function that runs after all chunks are processed should help here - but I'm not sure exactly where that should go.

Chunked broadcast with objects of smaller dimensions is broken.

Using the setup code from the tests (below) this chunked 2*2 broadcast with a vector seems broken:

julia> a_disk = _DiskArray([1 10; 2 20; 3 30; 4 40], chunksize=(2, 2))
Disk Array with size 4 x 2

julia> a_disk .* [1, 2, 3, 4] |> collect
4×2 Matrix{Int64}:
  1    9
  4   16
 10   90
 40  160

julia> [1 10; 2 20; 3 30; 4 40] .* [1, 2, 3, 4]
4×2 Matrix{Int64}:
  1   10
  4   40
  9   90
 16  160

We get the chunks as columns. The same problem occurs for a Tuple, I found this trying to add Tuple support in #39.

Setup code:

using DiskArrays
using DiskArrays: ReshapedDiskArray, PermutedDiskArray
struct _DiskArray{T,N,A<:AbstractArray{T,N}} <: AbstractDiskArray{T,N}
  getindex_count::Ref{Int}
  setindex_count::Ref{Int}
  parent::A
  chunksize::NTuple{N,Int}
end
_DiskArray(a;chunksize=size(a)) = _DiskArray(Ref(0),Ref(0),a,chunksize)
Base.size(a::_DiskArray) = size(a.parent)
DiskArrays.haschunks(::_DiskArray) = DiskArrays.Chunked()
DiskArrays.eachchunk(a::_DiskArray) = DiskArrays.GridChunks(a,a.chunksize)
getindex_count(a::_DiskArray) = a.getindex_count[]
setindex_count(a::_DiskArray) = a.setindex_count[]
trueparent(a::_DiskArray) = a.parent
getindex_count(a::ReshapedDiskArray) = getindex_count(a.parent)
setindex_count(a::ReshapedDiskArray) = setindex_count(a.parent)
trueparent(a::ReshapedDiskArray) = trueparent(a.parent)
getindex_count(a::PermutedDiskArray) = getindex_count(a.a.parent)
setindex_count(a::PermutedDiskArray) = setindex_count(a.a.parent)
trueparent(a::PermutedDiskArray{T,N,<:PermutedDimsArray{T,N,perm,iperm}}) where {T,N,perm,iperm} = permutedims(trueparent(a.a.parent),perm)
function DiskArrays.readblock!(a::_DiskArray,aout,i::AbstractUnitRange...)
  ndims(a) == length(i) || error("Number of indices is not correct")
  all(r->isa(r,AbstractUnitRange),i) || error("Not all indices are unit ranges")
  #println("reading from indices ", join(string.(i)," "))
  a.getindex_count[] += 1
  aout .= a.parent[i...]
end
function DiskArrays.writeblock!(a::_DiskArray,v,i::AbstractUnitRange...)
  ndims(a) == length(i) || error("Number of indices is not correct")
  all(r->isa(r,AbstractUnitRange),i) || error("Not all indices are unit ranges")
  #println("Writing to indices ", join(string.(i)," "))
  a.setindex_count[] += 1
  view(a.parent,i...) .= v
end

Very slow performance indexing into netcdf with a vector

Indexing with a vector can be incredibly slow with the current implementation of batchgetindex.

From this comment by @Alexander-Barth
#131 (comment)

NetCDF.jl + DiskArray

julia> using NetCDF

julia> v = NetCDF.open("http://tds.hycom.org/thredds/dodsC/GLBy0.08/expt_93.0","water_temp");

julia> size(v)
(4500, 4251, 40, 14969)

julia> v1 = @time v[1000,1000,1,1:10];
  1.478774 seconds (3.71 M allocations: 250.669 MiB, 3.42% gc time, 59.32% compilation time)

julia> v1 = @time v[1000,1000,1,1:10];
  0.467219 seconds (45 allocations: 1.781 KiB)

julia> v1 = @time v[1000,1000,1,collect(1:10)];
739.847348 seconds (1.32 M allocations: 819.944 MiB, 0.03% gc time, 0.10% compilation time)
From this comment by @Alexander-Barth 
https://github.com/meggart/DiskArrays.jl/issues/131#issuecomment-1902759762

NCDatasets 0.12 without DiskArray

julia> using NCDatasets

julia> ds = NCDataset("http://tds.hycom.org/thredds/dodsC/GLBy0.08/expt_93.0"); v = ds["water_temp"];

julia> v1 = @time v[1000,1000,1,1:10];
  0.726571 seconds (134.15 k allocations: 8.849 MiB, 14.90% compilation time)

julia> v1 = @time v[1000,1000,1,1:10];
  0.218793 seconds (62 allocations: 1.984 KiB)

julia> v1 = @time v[1000,1000,1,collect(1:10)];
  0.879592 seconds (982.55 k allocations: 66.829 MiB, 9.49% gc time, 65.79% compilation time)

So 12 minutes versus 1 seconds for this example. Yes, I know that I can bypass DiskArrays by providing my own batchgetindex. But I don't think that DiskArray should make the current assumptions if it aims to be generic (if this is the goal).

Can we have the current assumptions (i.e. reading a chunk similarly fast than reading a subset of chunk) documented?

Concerning the API, another think I am wondering if we need a function named batchgetindex at all. Why not having getindex simply pass the indices to DiskArrays.readblock! and let DiskArrays.readblock! figure out how to best load the data. DiskArrays.readblock! is already specific to the storage format.

Lazy mean, sum etc?

It seems possible to add a reducing function application wrapper that will take e.g. the mean/sum etc along an axis lazily, similar to how lazy broadcasting works.

dimensions get moved around in unintuitive ways when accessing DiskArray using CartesianIndex

I noticed the the dimensions get moved around in unintuitive ways

dc = Zarr.zopen(path2zarr)

size(dc["v"])
(834, 834, 84396)

r = [1 4 6 1]
c = [7 8 9 3]

cartind = CartesianIndex.(r,c)

size(dc["v"][cartind, :])
(1, 84396, 4)

cartind = CartesianIndex.(r',c')

size(dc["v"][cartind, :])
(4, 84396, 1)

This is different behavior that non DiskArrays

z = rand(100,100,1000)

cartind = CartesianIndex.(r,c)

size(z[cartind,:])
(1, 4, 1000)

cartind = CartesianIndex.(r',c')
size(z[cartind,:])
(4, 1, 1000)

`BoundsError` when extra-indexing with `AbstractVector`

In regular Julia array, it is possible to index a N-dimensional array with more than N indices (see the documentation). When using an AbstractVector for additional indices, it fails with a DiskArray, which is inconsistent with the behaviour of Julia arrays:

Julia arrays:

julia> A = [8,6,7];
julia> A[2, [1]] # or A[2, 1:1]
1-element Vector{Int64}:
 6

_DiskArray from the tests:

julia> DA = _DiskArray(A);
julia> DA[2, [1]] # or DA[2, 1:1]
ERROR: BoundsError: attempt to access 3-element _DiskArray{Int64, 1, Vector{Int64}} at index [2, [1]]
Stacktrace:
 [1] #7
   @ ~/.julia/dev/DiskArrays/src/diskarray.jl:92 [inlined]
 [2] foreach
   @ ./abstractarray.jl:2712 [inlined]
 [3] interpret_indices_disk(A::_DiskArray{Int64, 1, Vector{Int64}}, r::Tuple{Int64, Vector{Int64}})
   @ DiskArrays ~/.julia/dev/DiskArrays/src/diskarray.jl:91
 [4] getindex_disk(::_DiskArray{Int64, 1, Vector{Int64}}, ::Int64, ::Vararg{Any})
   @ DiskArrays ~/.julia/dev/DiskArrays/src/diskarray.jl:29
 [5] getindex(::_DiskArray{Int64, 1, Vector{Int64}}, ::Int64, ::Vector{Int64})
   @ Main ~/.julia/dev/DiskArrays/src/diskarray.jl:178
 [6] top-level scope
   @ REPL[56]:1

Note that there's no issue when indexing with an integer (DA[2, 1] works fine)

This issue originated from here

cc @rafaqz

DiskArrays.jl vs BlockArrays.jl

Hey,

this is probably old news for you, but I just discovered BlockArrays.jl. If I understand correctly, BlockArrays.jl implements a lot of, if not all the features we need for DiskArrays.jl.

Does this make DiskArrrays.jl obsolote?

Sorry, if I'm going through the same thought process you already covered long ago, but may this can help.

Array conversion much slower than access via [:,:]

When I want to convert my DiskArray into a normal memory Array, the conversion with Array is very slow compared to the getindex acess.
Is this something that we could fix in DiskArrays or is there a better method to do the conversion?

julia> @time palfreqdata = palfreq[:,:]
  0.072899 seconds (157 allocations: 166.524 MiB)
5820×7500 Matrix{Float32}:
 1.0f32  1.0f32  1.0f32  1.0f32  1.0f32    1.0f32  1.0f32  1.0f32  1.0f32  1.0f32
 1.0f32  1.0f32  1.0f32  1.0f32  1.0f32     1.0f32  1.0f32  1.0f32  1.0f32  1.0f32
 1.0f32  1.0f32  1.0f32  1.0f32  1.0f32     1.0f32  1.0f32  1.0f32  1.0f32  1.0f32
                                                                         
 1.0f32  1.0f32  1.0f32  1.0f32  1.0f32     1.0f32  1.0f32  1.0f32  1.0f32  1.0f32
 1.0f32  1.0f32  1.0f32  1.0f32  1.0f32     1.0f32  1.0f32  1.0f32  1.0f32  1.0f32
 1.0f32  1.0f32  1.0f32  1.0f32  1.0f32     1.0f32  1.0f32  1.0f32  1.0f32  1.0f32

julia> @time Array(palfreq.data)
  9.528337 seconds (218.25 M allocations: 14.635 GiB, 21.90% gc time)
5820×7500 Matrix{Float32}:
 1.0f32  1.0f32  1.0f32  1.0f32  1.0f32    1.0f32  1.0f32  1.0f32  1.0f32  1.0f32
 1.0f32  1.0f32  1.0f32  1.0f32  1.0f32     1.0f32  1.0f32  1.0f32  1.0f32  1.0f32
 1.0f32  1.0f32  1.0f32  1.0f32  1.0f32     1.0f32  1.0f32  1.0f32  1.0f32  1.0f32
                                                                         
 1.0f32  1.0f32  1.0f32  1.0f32  1.0f32     1.0f32  1.0f32  1.0f32  1.0f32  1.0f32
 1.0f32  1.0f32  1.0f32  1.0f32  1.0f32     1.0f32  1.0f32  1.0f32  1.0f32  1.0f32
 1.0f32  1.0f32  1.0f32  1.0f32  1.0f32     1.0f32  1.0f32  1.0f32  1.0f32  1.0f32

Implement reverse indexing

zarr[end:-1:1, :, :] not working

There is probably some lastindex implementation missing. Should be added as a new indexing method

`cat` should work for iterators of Ints

According to the documentation of cat it should be possible to call it with an iterable of dimensions so that the "the positions along these dimensions are increased
simultaneously for each input array, filling with zero elsewhere. This allows one to construct
block-diagonal matrices as cat(matrices...; dims=(1,2)), and their higher-dimensional analogues."

We should try to adhere to this interface and I am wondering whether we would need to have a new type ConcatBlockDiskArray to overload the writeblock and readblock functions.

This part of the cat interface is used for example in DimensionalData. Therefore we can't currently use cat of a DiskArray backed AbstractDimArray.

Feature Request: cache chunk data

It will be more efficient to cache the current chunk in use when using a for loop to iterate over the data because each reading has a non-neglegible overhead (e.g. HDF5.jl).

optimization for var[:,:,[1,2]], wrong result with indexing with two vectors and the type-stability issue

In NCDatasets 0.12.17 I had optimization for var[:,:,[1,2]] converting it to var[:,:,1:2] and
var[:,:,[1,2,10,11]] into two load operation v[:,:,1:2] and v[:,:,10:11]. This was implemented here:

https://github.com/Alexander-Barth/NCDatasets.jl/blob/v0.12.17/src/cfvariable.jl#L359-L436

Now a call to v[:,:,[1,2]] loads every element individually (which quite slow).

In fact, the relevant code is still in NDatasets 0.13 but no longer called.

Maybe this functionality should go into DiskArrays?

ConstructionBase.jl dep

Would it be possible to add a ConstructionBase.jl dependency here? (its tiny)

The reason is DiskArrays.SubDiskArray and probably other objects here can't be reconstructed from their fields alone - they need type information as well. ConstructionBase.jl defines constructorof to handle this. Then @set from Setfield.jl and reconstruct from Flatten.jl (and BangBang.jl, Accessors.jl etc) can update their fields with new objects.

My use case is opening GeoData.FileArray with write access through nested DiskArrays.SubDiskArray and whatever else is wrapping it.

As you don't know the number of wrappers or what they are it's hard to do this manually. Flatten.modify can just do it automatically and rewrap the whole stack however deep it is. But only if ConstructionBase.constructorof works for all the wrappers.

The option is to put it in ConstructionBaseExtras.jl. But using Requires.jl and a wrapper package is a half-baked measure without dependency bounds, so I like to ask for a direct dep first.

Chunks do not align in dimension 1

Can we work around this? This being an error is a problem for using chunked broadcast as a default.

I'm not totally across the underlying mechanisms yet, but I can imagine 2 possibilities:

  • force chunk alignment by loading multiple chunks to a temporary chunk of the right size
  • load the whole array to memory and ignore chunking, especially if the array is small

Then the error could be a warning.

Example in README.md

Hi Fabian, when I have some issue with the integration of NCDatasets/DiskArrays and I would like to know a bit better how DiskArrays works independently of NCDatasets. I noticed in the example in your README.md that RangeArray is not defined.

julia> a = RangeArray(4,5,1)
ERROR: UndefVarError: RangeArray not defined

Maybe it should be something like a = PseudoDiskArray(randn(4,5,1)) if I understand the example.

What is the most efficient way to access discrete columns of a chuncked array?

What is the most efficient way to access discrete columns of a zarr array? This follows on from the conversation:
here

Lets say I have a Zarr array that looks like this:

z = rand(100,100,1000)

and let's pretend that z is a chunked Zarr array that lives on S3

Now if I want to get a series of discrete columns from z:

z[3,5,:]
z[5,9,:]
z[10,6,:]

Is there a way I can do this that is efficient and does not read in the same chunk of data more than once?

`view` with an AbstractRange (not AbstractUnitRange) is broken

This is because subsetchunks is specifically only written for reverse

function subsetchunks(r::RegularChunks, subs::AbstractRange)
# This is a method only to make "reverse" work and should error for all other cases
if step(subs) == -1 && first(subs) == r.s && last(subs) == 1
lastlen = length(last(r))
newoffset = r.cs - lastlen
return RegularChunks(r.cs, newoffset, r.s)
end
end

Should we just subset it as if there is not step? will that work?

Edit: it doesn't.

Maybe the batch PR can work for this as well? views like this are very useful for reducing the size of (very) large rasters for plotting. Rasters.jl has been using this prior to 0.3.

Implementing in other packages

I think the general indexing functionality for this package is there now, so that one could use this package as a dependency for NetCDF, HDF5 or Zarr without losing functionality. Here we can colltect a list to the packages that might profit from DiskArrays and link to the respective PRs:

@visr @rafaqz @Alexander-Barth any suggestions and feedback would be highly appreciated

Implement chunked `foreach`/non allocating chunked iteration.

As pointed out in #15 broadcast can do pretty much everything. But one issue is you have to call collect and allocate an array?

In GeoData.jl I'm using broadcast to iterate over arrays to e.g. calculate ranges of non nodata values for a trim method, using broadcast so the operation is chunked. But I still need to call collect on the broadcast object, which allocates.

It would be good if there was a way to iterate over the object with chunking that didn't allocate. Maybe that's possible already?

Finding an org for this package

As already discussed in #2 before registering this package I would like to decide about the org where the repo should be hosted. It was suggested that JuliaArrays would be a good fit, it is not obvious who to contact so I am pinging @mbauman who was involved in the initial discourse discussion, if he can comment if this package would meet the criteria to be included in JuliaArrays.

Use TiledIteration.jl?

For GridChunks, we could use TiledIteration.jl instead of implementing everything here. In general, I wonder if TiledIteration.jl couldn't become the interface module for all tiled array iteration (if it isn't already).

I have an implementation of iteration using TiledIteration.jl in HDF5Arrays.jl, so I could make a PR, if there's interest.

If we limit ourselves to tiles (GridChunks in DiskArrays.jl), we can also use tiled indexing to make a lot of the code implementing broadcasting, reduce operations, etc. cleaner. Further, the code could be moved to TiledIteration.jl, so non-DiskArrays can benefit from the tools here, too. See this proposal for tiled indexing

unlimited dimensions in NetCDF

As a follow up of the discussion (Alexander-Barth/NCDatasets.jl#79), I would like to know if unlimited dimensions (i.e. growable) could be supported by DiskArrays. I hope you it is OK for you that I make a separate issue here. It is easier for me to track what is already done and what still needs to be discussed.

The last proposition by Fabian has been:

Maybe it would help if I added a keyword argument checkbounds=false to the DiskArrays setindex implementation. So when you call setindex! in your DimensionallArrays code you can explicitly enforce bounds checking. In addition we can still discuss what the default behavior for any DiskArray should be...

I think this would be a good idea.

One could also annotate the type DiskArray with an additional parameter whether its size is fixed or not:

using Base: setindex!, size

struct FixedSize; end
struct MyArray{T,N,TFS} <: AbstractArray{T,N}; end

Base.size(A::MyArray) = (10,)

function fixedsize(::MyArray{T,N,TFS}) where {T,N,TFS}
    return TFS == FixedSize
end

function Base.setindex!(A::MyArray,value,i::Int)
    if fixedsize(A) && (i > 10)
        error("you can't do that: the array has a fixed size")
    end
    println("ok ",i)
end

# growable array
A = MyArray{Int,1,Nothing}()
A[5] = 1
A[15] = 1

# fixed-size array
Af = MyArray{Int,1,FixedSize}()
Af[5] = 1
Af[15] = 1 # error

It is decided during creation of a DiskArray if the array can grow or not and all subsequent calls can use A[index] = value rather than setindex(A,value,index,checkbounds=true/false). The compile should be able to specialize for fixed and growable arrays (there could even be two definitions of setindex! for the two cases).

CC @rafaqz, @Balinus, @visr

Changes to iteration over an `Unchunked` SubDiskArray?

Equality of an array and a view of a DiskArray based Raster has broken with 0.3 when the array is Unchunked:

using Rasters
url = "https://download.osgeo.org/geotiff/samples/gdal_eg/cea.tif"
path = download(url)
A = Raster(path)
view(A, 1, 2:3, 1) == [0x00, 0x6b]

Gives the error:

ERROR: MethodError: Cannot `convert` an object of type Tuple{UnitRange{Int64}} to an object of type Int64
Closest candidates are:
  convert(::Type{T}, ::Base.TwicePrecision) where T<:Number at /opt/julia/share/julia/base/twiceprecision.jl:262
  convert(::Type{T}, ::AbstractChar) where T<:Number at /opt/julia/share/julia/base/char.jl:185
  convert(::Type{T}, ::CartesianIndex{1}) where T<:Number at /opt/julia/share/julia/base/multidimensional.jl:136
  ...

Stacktrace:
  [1] DiskArrays.RegularChunks(cs::Tuple{UnitRange{Int64}}, offset::Int64, s::Int64)
    @ DiskArrays ~/.julia/packages/DiskArrays/7xJDq/src/chunks.jl:17
  [2] (::DiskArrays.var"#34#37")(s::Int64, cs::Tuple{UnitRange{Int64}}, of::Int64)
    @ DiskArrays ~/.julia/packages/DiskArrays/7xJDq/src/chunks.jl:101
  [3] (::Base.var"#4#5"{DiskArrays.var"#34#37"})(a::Tuple{Int64, Tuple{UnitRange{Int64}}, Int64})
    @ Base ./generator.jl:36
  [4] iterate
    @ ./generator.jl:47 [inlined]
  [5] collect(itr::Base.Generator{Base.Iterators.Zip{Tuple{Tuple{Int64}, DiskArrays.GridChunks{1}, Tuple{Int64}}}, Base.var"#4#5"{DiskArrays.var"#34#37"}})
    @ Base ./array.jl:724
  [6] map
    @ ./abstractarray.jl:2948 [inlined]
  [7] DiskArrays.GridChunks(a::Tuple{Int64}, chunksize::DiskArrays.GridChunks{1}; offset::Tuple{Int64})
    @ DiskArrays ~/.julia/packages/DiskArrays/7xJDq/src/chunks.jl:100
  [8] #GridChunks#28
    @ ~/.julia/packages/DiskArrays/7xJDq/src/chunks.jl:98 [inlined]
  [9] GridChunks
    @ ~/.julia/packages/DiskArrays/7xJDq/src/chunks.jl:98 [inlined]
 [10] eachchunk_view(#unused#::DiskArrays.Unchunked, a::SubArray{UInt8, 1, FileArray{GDALfile, UInt8, 3, Nothing, DiskArrays.GridChunks{3}, DiskArrays.Unchunked}, Tuple{Int64, UnitRange{Int64}, Int64}, false})
    @ DiskArrays ~/.julia/packages/DiskArrays/7xJDq/src/subarrays.jl:35
 [11] eachchunk(a::DiskArrays.SubDiskArray{UInt8, 1})
    @ DiskArrays ~/.julia/packages/DiskArrays/7xJDq/src/subarrays.jl:26
 [12] iterate(a::DiskArrays.SubDiskArray{UInt8, 1})
    @ DiskArrays ~/.julia/packages/DiskArrays/7xJDq/src/iterator.jl:4
 [13] iterate
    @ ~/.julia/dev/DimensionalData/src/array/array.jl:66 [inlined]
 [14] _zip_iterate_some
    @ ./iterators.jl:358 [inlined]
 [15] _zip_iterate_all
    @ ./iterators.jl:350 [inlined]
 [16] iterate
    @ ./iterators.jl:340 [inlined]
 [17] ==(A::Raster{UInt8, 1, Tuple{Y{Projected{Float64, LinRange{Float64, Int64}, ReverseOrdered, Regular{Float64}, Intervals{Start}, Metadata{GDALfile, Dict{Any, Any}}, WellKnownText{GeoFormatTypes.CRS, String}, Nothing, Y{Colon}}}}, Tuple{X{Projected{Float64, LinRange{Float64, Int64}, ForwardOrdered, Regular{Float64}, Intervals{Start}, Metadata{GDALfile, Dict{Any, Any}}, WellKnownText{GeoFormatTypes.CRS, String}, Nothing, X{Colon}}}, Band{Categorical{Int64, UnitRange{Int64}, ForwardOrdered, NoMetadata}}}, DiskArrays.SubDiskArray{UInt8, 1}, Symbol, Metadata{GDALfile, Dict{Symbol, Any}}, Nothing}, B::Vector{UInt8})
    @ Base ./abstractarray.jl:2539
 [18] top-level scope
    @ REPL[125]:1

@meggart if you know what is going on with this stacktrace that would be helpful

common_chunks fails in broadcast with mixed ndims

Failing here for a broadcast between 2 and 3 dimensional arrays:

https://github.com/meggart/DiskArrays.jl/blob/master/src/ops.jl#L60

ERROR: BoundsError
Stacktrace:
  [1] getindex
    @ ./number.jl:96 [inlined]
  [2] #112
    @ ~/.julia/dev/DiskArrays/src/ops.jl:61 [inlined]
  [3] map(f::DiskArrays.var"#112#120", t::Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}, Int64})
    @ Base ./tuple.jl:215
  [4] common_chunks(::Tuple{Int64, Int64, Int64}, ::GeoData.FileArray{GeoData.GDALfile, Float32, 3, String, Tuple{Int64, Int64, Int64}, Nothing, Tuple{Int64, Int64, Int64}}, ::Vararg{Any, N} where N)
    @ DiskArrays ~/.julia/dev/DiskArrays/src/ops.jl:61
  [5] eachchunk(a::DiskArrays.BroadcastDiskArray{Bool, 3, Broadcasted{DiskArrays.ChunkStyle{3}, Nothing, typeof(==), Tuple{GeoData.FileArray{GeoData.GDALfile, Float32, 3, String, Tuple{Int64, Int64, Int64}, Nothing, Tuple{Int64, Int64, Int64}}, GeoData.FileArray{SMAPfile, Float32, 2, String, Tuple{Int64, Int64}, Symbol, Nothing}}}})
    @ DiskArrays ~/.julia/dev/DiskArrays/src/ops.jl:22
  [6] iterate(a::DiskArrays.BroadcastDiskArray{Bool, 3, Broadcasted{DiskArrays.ChunkStyle{3}, Nothing, typeof(==), Tuple{GeoData.FileArray{GeoData.GDALfile, Float32, 3, String, Tuple{Int64, Int64, Int64}, Nothing, Tuple{Int64, Int64, Int64}}, GeoData.FileArray{SMAPfile, Float32, 2, String, Tuple{Int64, Int64}, Symbol, Nothing}}}})
    @ DiskArrays ~/.julia/dev/DiskArrays/src/iterator.jl:4
  [7] _all

Because at that point:

tt = ((3856, 0), (1624, 0), 1)

I'm not quite sure what the solution is.

lazy reverse(a; dims=n) with keyword dimension

It would be good to dispatch on the DiskArray implementation for the lazy reverse when one specifies the dimension by a keyword argument.
It dispatches directly if the dimension is the second positional argument, but the keyword argument is dispatched into the default Base implementation.

a:: DiskArray

julia> reverse(a, dims=1) # Non lazy
10000×10000 Matrix{UInt8}:
 0x00  0x00    0x00  0x00
 0x00  0x00     0x00  0x00
 0x00  0x00     0x00  0x00
 0x00  0x00     0x00  0x00
 0x00  0x00     0x00  0x00
 0x00  0x00    0x00  0x00
 0x00  0x00     0x00  0x00
 0x00  0x00     0x00  0x00
 0x00  0x00     0x00  0x00
                    
 0x00  0x00     0x00  0x00
 0x00  0x00     0x00  0x00
 0x00  0x00     0x00  0x00
 0x00  0x00    0x00  0x00
 0x00  0x00     0x00  0x00
 0x00  0x00     0x00  0x00
 0x00  0x00     0x00  0x00
 0x00  0x00     0x00  0x00

julia> reverse(a, 1) # Lazy 
Disk Array with size 10000 x 10000

fix generators

We could define a DiskGenerator object with a method for collect that properly uses eachindex to populate the in-memory array in the right order.

Backend generalisation

@meggart instead of using Flatten.jl we could have a mechanism for propagating the engine trait between DiskArray layers? It might be more robust.

So they can all just forward to the parent disk object, but e.g BroadcastDiskArray might need to store the trait somewhere.

Fix zip and other cases of simultaneous iteration

zip is used in array equality, and breaks chunking order when arrays with different chunks are used.

We should at minimum fix it so that the chunks of the first disk array found are used for all others (and for regular arrays). This can be achieved with a RechunkDiskArray wrapper that simply forces a different chunk pattern. It may be slow in some cases but will at least be correct.

Later we could optimise the combined iteration pattern of all diskarrays passed to zip.

Batch getindex is not typesafe

Batchindex with batches is not typesafe at the moment.

julia> @code_warntype c.data[[(1, 1), (1, 2)], :, :]
MethodInstance for getindex(::DiskArrayTools.DiskArrayStack{Union{Missing, Float32}, 4, DiskArrays.SubDiskArray{Union{Missing, Float32}, 3}, 1}, ::Vector{Tuple{Int64, Int64}}, ::Colon, ::Colon)
  from getindex(a::DiskArrays.AbstractDiskArray, i...) in DiskArrays at /home/gkraemer/.julia/packages/DiskArrays/bcti2/src/diskarray.jl:174
Arguments
  #self#::Core.Const(getindex)
  a::DiskArrayTools.DiskArrayStack{Union{Missing, Float32}, 4, DiskArrays.SubDiskArray{Union{Missing, Float32}, 3}, 1}
  i::Tuple{Vector{Tuple{Int64, Int64}}, Colon, Colon}
Body::Any
1%1 = Core.tuple(a)::Tuple{DiskArrayTools.DiskArrayStack{Union{Missing, Float32}, 4, DiskArrays.SubDiskArray{Union{Missing, Float32}, 3}, 1}}
│   %2 = Core._apply_iterate(Base.iterate, DiskArrays.getindex_disk, %1, i)::Any
└──      return %2

Support permutedims and reshape

Currently both of these work on DiskArrays (because) of the AbstractArray subtype, but do not get the AbstractDiskArrays behavior. So one would have to define getindex, setindex! and eachchunk in a similar way as is done in subarrays.jl for ReshapedArray{<:Any,<:Any,<:AbstractDiskArray} and and PermutedDimsArray{<:Any,<:Any,<:AbstractDiskArray}

Iteration is broken with chunking

Iteration seems to be broken with chunking, so when fallbacks are used by base methods you just get the wrong answer, without an error. So e.g. Array(diskarray) currently uses iteration, and can give the wrong result:

julia>   a = collect(reshape(1:90,10,9))
10×9 Matrix{Int64}:
  1  11  21  31  41  51  61  71  81
  2  12  22  32  42  52  62  72  82
  3  13  23  33  43  53  63  73  83
  4  14  24  34  44  54  64  74  84
  5  15  25  35  45  55  65  75  85
  6  16  26  36  46  56  66  76  86
  7  17  27  37  47  57  67  77  87
  8  18  28  38  48  58  68  78  88
  9  19  29  39  49  59  69  79  89
 10  20  30  40  50  60  70  80  90

julia>   a_disk = _DiskArray(a, chunksize=(5,3))
Disk Array with size 10 x 9

julia>   Array(a_disk)
10×9 Matrix{Int64}:
  1  21  16  31  51  46  61  81  76
  2  22  17  32  52  47  62  82  77
  3  23  18  33  53  48  63  83  78
  4  24  19  34  54  49  64  84  79
  5  25  20  35  55  50  65  85  80
 11   6  26  41  36  56  71  66  86
 12   7  27  42  37  57  72  67  87
 13   8  28  43  38  58  73  68  88
 14   9  29  44  39  59  74  69  89
 15  10  30  45  40  60  75  70  90

#37 may be a partial solution to this. But maybe also good to use other methods as much as possible so the iteration fallback isn't used.

julia 1.9/nightly error: type Stateful has no field taken

ArchGDAL is currently failing tests on 1.9/nightly with "type Stateful has no field taken".
That seems to be due to the .taken here:

cistateold = chunkiter.taken
biter = iterate(bi, bstate)
if biter === nothing
return nothing
end
innernow, bstatenew = biter
(chunkiter, innerinds) = bstatenew
if chunkiter.taken !== cistateold

This field was removed in JuliaLang/julia#45924.

This is the full error message from ArchGDAL.

Complex IO: Error During Test at /home/runner/work/ArchGDAL.jl/ArchGDAL.jl/test/test_rasterio.jl:294
  Test threw exception
  Expression: AG.getband(ds_copy, 1) == a
  type Stateful has no field taken
  Stacktrace:
   [1] getproperty
     @ ./Base.jl:37 [inlined]
   [2] iterate(a::ArchGDAL.IRasterBand{ComplexF32}, i::Tuple{OffsetArrays.OffsetMatrix{ComplexF32, Matrix{ComplexF32}}, DiskArrays.BlockedIndices{DiskArrays.GridChunks{2}}, Tuple{Base.Iterators.Stateful{DiskArrays.GridChunks{2}, Union{Nothing, Tuple{Any, Tuple{Any, Vararg{Any}}}}, Int64}, Base.Iterators.Stateful{CartesianIndices{2, Tuple{UnitRange{Int64}, UnitRange{Int64}}}, Union{Nothing, Tuple{CartesianIndex{2}, CartesianIndex{2}}}, Int64}}})
     @ DiskArrays ~/.julia/packages/DiskArrays/VGEpq/src/iterator.jl:53
   [3] _zip_iterate_some
     @ ./iterators.jl:424 [inlined]
   [4] _zip_iterate_all
     @ ./iterators.jl:416 [inlined]
   [5] iterate
     @ ./iterators.jl:407 [inlined]
   [6] ==(A::ArchGDAL.IRasterBand{ComplexF32}, B::Matrix{ComplexF32})
     @ Base ./abstractarray.jl:2919
   [7] eval_test(evaluated::Expr, quoted::Expr, source::LineNumberNode, negate::Bool)
     @ Test /opt/hostedtoolcache/julia/nightly/x64/share/julia/stdlib/v1.10/Test/src/Test.jl:320
   [8] macro expansion
     @ /opt/hostedtoolcache/julia/nightly/x64/share/julia/stdlib/v1.10/Test/src/Test.jl:477 [inlined]
   [9] (::var"#391#393"{Matrix{ComplexF32}})(ds_copy::ArchGDAL.Dataset)
     @ Main ~/work/ArchGDAL.jl/ArchGDAL.jl/test/test_rasterio.jl:294

It's not directly clear to me what the right fix is. Is that calling length() on the iterator?

Linear indexing?

Linear indexing doesn't seem to work with an AbstrctDiskArray. Is that intentional or do I need to add something to the interface? Here FileArray <: AbstractDiskArray

julia> A[1]
ERROR: BoundsError: attempt to access 267×178×1 GeoData.FileArray{GeoData.GDALfile, Int32, 3, String, Tuple{Int64, Int64, Int64}, Nothing, Tuple{Int64, Int64, Int64}} at index [1]
Stacktrace:
 [1] interpret_indices_disk
   @ ~/.julia/packages/DiskArrays/QATDH/src/DiskArrays.jl:118 [inlined]
 [2] getindex_disk(a::GeoData.FileArray{GeoData.GDALfile, Int32, 3, String, Tuple{Int64, Int64, Int64}, Nothing, Tuple{Int64, Int64, Int64}}, i::Int64)
   @ DiskArrays ~/.julia/packages/DiskArrays/QATDH/src/DiskArrays.jl:58
 [3] getindex
   @ ~/.julia/packages/DiskArrays/QATDH/src/DiskArrays.jl:186 [inlined]
 [4] getindex(A::GeoArray{Int32, 3, Tuple{X{LinRange{Float64}, Projected{Ordered{ForwardIndex, ForwardArray, ForwardRelation}, Regular{Float64}, Intervals{Start}, WellKnownText{GeoFormatTypes.CRS, String}, Nothing}, Metadata{GeoData.GDALfile, Dict{Any, Any}}}, Y{LinRange{Float64}, Projected{Ordered{ReverseIndex, ReverseArray, ForwardRelation}, Regular{Float64}, Intervals{Start}, WellKnownText{GeoFormatTypes.CRS, String}, Nothing}, Metadata{GeoData.GDALfile, Dict{Any, Any}}}, Band{UnitRange{Int64}, Categorical{Ordered{ForwardIndex, ForwardArray, ForwardRelation}}, NoMetadata}}, Tuple{}, GeoData.FileArray{GeoData.GDALfile, Int32, 3, String, Tuple{Int64, Int64, Int64}, Nothing, Tuple{Int64, Int64, Int64}}, Symbol, Metadata{GeoData.GDALfile, Dict{Symbol, Any}}, Float64}, i::Int64)
   @ DimensionalData ~/.julia/dev/DimensionalData/src/indexing.jl:22
 [5] top-level scope
   @ REPL[56]:1

Don't load whole chunk when you don't need to, e.g. with NetCDF

We can likely get around this problem by skipping batchgetindex for small indexing operations. We may also need to facilitate easy backend customisation of this behaviour.

Originally posted by @Alexander-Barth in #131 (comment)

In the worst-case scenario (reading a time series for spatial chunked data), one would read a single value from each chunk:

Test with NCDatasets 0.12.17:

using BenchmarkTools                                                                                                                                                  
using NCDatasets                                                                                                                                                      
                                                                                                                                                                      
sz = (1024,1024,64)                                                                                                                                                   
ds = NCDataset("/tmp/tmp.nc","c")                                                                                                                                     
defDim(ds,"lon",sz[1])                                                                                                                                                
defDim(ds,"lat",sz[2])                                                                                                                                                
defDim(ds,"time",sz[3])                                                                                                                                               
                                                                                                                                                                      
defVar(ds,"data",Float64,("lon","lat","time"),chunksizes=[sz[1:2]...,1])                                                                                               
defVar(ds,"datac",Float64,("lon","lat","time"),chunksizes=[sz[1:2]...,1],deflatelevel = 9)                                                                            
                                                                                                                                                                      
ds["data"][:] = randn(sz);                                                                                                                                            
ds["datac"][:] = randn(sz);                                                                                                                                           
                                                                                                                                                                      
@btime ds["data"][1,1,1];                                                                                                                                             
@btime ds["data"][:,:,1];                                                                                                                                             
                                                                                                                                                                      
@btime ds["datac"][1,1,1];                                                                                                                                            
@btime ds["datac"][:,:,1];

output:

julia> @btime ds["data"][1,1,1];                                                                                                                                      
  15.069 μs (34 allocations: 1.62 KiB)                                                                                                                                
                                                                                                                                                                      
julia> @btime ds["data"][:,:,1];                                                                                                                                      
  1.014 ms (35 allocations: 8.00 MiB)                                                                                                                                 
                                                                                                                                                                      
julia> @btime ds["datac"][1,1,1];                                                                                                                                     
  13.518 μs (34 allocations: 1.62 KiB)                                                                                                                                
                                                                                                                                                                      
julia> @btime ds["datac"][:,:,1];                                                                                                                                     
  470.025 μs (35 allocations: 8.00 MiB)                                                                                                                               

So reading an entire chunk (if you do not need it) is 67 times slower for uncompressed data and 36 times slower for compressed data. But we read the same data multiple times in @btime.
The netCDF library is likely cache the data. Some application might indeed read the same data multiple times but this is less typical. If the data is read for the first time, the differences are not as large:

sz = (1024,1024,64)                                                                                                                                                   
isfile("/tmp/tmp3.nc") && rm("/tmp/tmp3.nc")                                                                                                                          
ds = NCDataset("/tmp/tmp3.nc","c")                                                                                                                                    
defDim(ds,"lon",sz[1])                                                                                                                                                
defDim(ds,"lat",sz[2])                                                                                                                                                
defDim(ds,"time",sz[3])                                                                                                                                               
                                                                                                                                                                      
defVar(ds,"data",Float64,("lon","lat","time"),chunksizes=[sz[1:2]...,1])                                                                                              
defVar(ds,"datac",Float64,("lon","lat","time"),chunksizes=[sz[1:2]...,1],deflatelevel = 9)                                                                            
                                                                                                                                                                      
ds["data"][:] = randn(sz);                                                                                                                                            
ds["datac"][:] = randn(sz);                                                                                                                                           
close(ds)                                                                                                                                                             
                                                                                                                                                                      
ds = NCDataset("/tmp/tmp3.nc")                                                                                                                                        
# warm-up                                                                                                                                                             
ds["data"][1,1,1];                                                                                                                                                    
ds["data"][:,:,2];                                                                                                                                                    
                                                                                                                                                                      
@time ds["data"][1,1,3];                                                                                                                                              
@time ds["data"][:,:,4];                                                                                                                                              
                                                                                                                                                                      
@time ds["datac"][1,1,3];                                                                                                                                             
@time ds["datac"][:,:,4];           

output

  0.001231 seconds (34 allocations: 1.625 KiB)                                                                                                                        
  0.001682 seconds (35 allocations: 8.002 MiB)                                                                                                                        
  0.031734 seconds (34 allocations: 1.625 KiB)                                                                                                                        
  0.038516 seconds (35 allocations: 8.002 MiB)                                                                                                                        

So a difference of about 20 % in this case.

For:

@time ds["data"][512,512,3];
@time ds["data"][:,:,4];

@time ds["datac"][512,512,3];
@time ds["datac"][:,:,4];

I get (loading the first time)

  0.026237 seconds (36 allocations: 1.656 KiB)
  0.045077 seconds (35 allocations: 8.002 MiB)
  0.033074 seconds (36 allocations: 1.656 KiB)
  0.038558 seconds (35 allocations: 8.002 MiB)

If the data is in the libnetcdf chunk cache (using @btime), I get:

julia> @btime ds["data"][512,512,3];
  15.549 μs (36 allocations: 1.66 KiB)

julia> @btime ds["data"][:,:,4];
  421.606 μs (35 allocations: 8.00 MiB)

julia> @btime ds["datac"][512,512,3];
  15.361 μs (36 allocations: 1.66 KiB)

julia> @btime ds["datac"][:,:,4];
  417.527 μs (35 allocations: 8.00 MiB)

Feel free to make other benchmarks with other indices.
Another use case is to load data over the network via opendap or HTTP byte ranges request, you certainly do not want to transfer more data than necessary.

Handle concatenation with `cat`

Currently cat is very slow, and may even give the wrong anwer?

See: rafaqz/Rasters.jl#298

It could be more efficient to allocate the final array and then broadcast into views of it, instead of copying each disk array to memory separately.

1D variable with unlimited dimensions

While debugging an issue in NCDatasets using DiskArrays v0.3.20 ( Alexander-Barth/NCDatasets.jl#230 )
I found some behavior of NetCDF.jl with DiskArray with variables of unlimited dimensions that I could not understand.
While the following code with a 2D array (one unlimited) works without error:

import NetCDF
fname2 = tempname()
time_dim = NetCDF.NcDim("time",Int[],unlimited=true);
lon_dim = NetCDF.NcDim("lon",1:3);
v = NetCDF.NcVar("obs",[lon_dim,time_dim]);
NetCDF.create(fname2,v) do nc
   nc["obs"][:,1:3] = rand(3,3)
end;

The following code with a single unlimited dimensions fails:

import NetCDF
var = collect(1:4)
fname2 = tempname()
time_dim = NetCDF.NcDim("time",Int[],unlimited=true);
v = NetCDF.NcVar("var",[time_dim]);
NetCDF.create(fname2,v) do nc
    nc["var"][1:4] = var
end;

with the following error message:

ERROR: BoundsError: attempt to access 0-element CartesianIndices{1, Tuple{Base.OneTo{Int64}}} at index [1:4]                                                       
Stacktrace:
  [1] throw_boundserror(A::CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, I::Tuple{UnitRange{Int64}})                                                                      
    @ Base ./abstractarray.jl:744
  [2] checkbounds
    @ ./abstractarray.jl:709 [inlined]
  [3] getindex
    @ ./multidimensional.jl:358 [inlined]
  [4] view
    @ ./multidimensional.jl:372 [inlined]
  [5] interpret_indices_disk
    @ ~/.julia/packages/DiskArrays/VLHvZ/src/diskarray.jl:137 [inlined]
  [6] setindex_disk!(a::NcVar{Float64, 1, 6}, v::UnitRange{Int64}, i::UnitRange{Int64})
    @ DiskArrays ~/.julia/packages/DiskArrays/VLHvZ/src/diskarray.jl:56
  [7] setindex!(a::NcVar{Float64, 1, 6}, v::UnitRange{Int64}, i::UnitRange{Int64})
    @ DiskArrays ~/.julia/packages/DiskArrays/VLHvZ/src/diskarray.jl:222
  [8] (::var"#15#16")(nc::NcFile)
    @ Main ./REPL[54]:2
  [9] create(::var"#15#16", ::String, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})                                     
    @ NetCDF ~/.julia/packages/NetCDF/7hOe9/src/NetCDF.jl:1022
 [10] create(::Function, ::String, ::NcVar{Float64, 1, 6})
    @ NetCDF ~/.julia/packages/NetCDF/7hOe9/src/NetCDF.jl:1019
 [11] top-level scope
    @ REPL[54]:1

NetCDF.jl and the equivalent NCDatasets.jl code both fail at the same line:

https://github.com/meggart/DiskArrays.jl/blob/v0.3.20/src/diskarray.jl#L137

Ref:
#53

Implement PermutedDimsArray for DiskArrays

Hi, I noticed bad performance when using PermutedDimsArray with ArchGDAL and the issue probably is located here. See yeesian/ArchGDAL.jl#225, which explains the problem. However I don't see an easy fix like proposed over there, as it runs into a stack overflow for AbstractDiskArrays.

Wrong behavior of default Array constructor

I do not have a reproducible example at hand, but there is a problem with using the default Array constructor on chunked AbstractDiskArrays. Adding the following method fixed the code:

function Array(a::AbstractDiskArray)
    aout = similar(a)
    aout .= a
    aout
end

but it is worth to investigate and add some unit tests when fixing this.

Index error when using mapCube and Distributed

Here a minimal working example:

using Pkg

Pkg.activate("/Net/Groups/BGI/people/dpabon/nfdi4earth_oemc")

using Distributed
println(myid()," ", DEPOT_PATH)

addprocs(10)
@everywhere begin
using Pkg

Pkg.activate("/Net/Groups/BGI/people/dpabon/nfdi4earth_oemc")
 
using YAXArrays
using Zarr
function simple_function(out1, cube_in1, cube_in2)
    #=
    out1 .= 1
    =#
end
end

lcc_final_cube = open_dataset("/Net/Groups/BGI/scratch/dpabon/lcc_final_cube_2002_2011_8d-0.083deg.zarr")
lcc_final_cube = Cube(lcc_final_cube)


lcc_classes = ["Evergreen_Broadleaf_Forests", 
    "Deciduous_Broadleaf_Forests",
    "Mixed_Forest",
    "Savannas", 
    "Grasslands",
    "Croplands"]


lcc_small = lcc_final_cube[lon =(-78, -68), lat = (-5, 15), time = 2002:2003, classes = lcc_classes]



lst_small = open_dataset("/Net/Groups/BGI/scratch/dpabon/lst_small.zarr")
lst_small = Cube(lst_small)

renameaxis!(lcc_small, "lon" => getAxis("lon", lst_small))
renameaxis!(lcc_small, "lat" => getAxis("lat", lst_small))

 

indims_cube_1 = InDims("time")

indims_cube_2 = InDims("time", "classes")

out_1_dims = OutDims("time", CategoricalAxis("summary_stat", ["rsquared"]))
    

mapCube(simple_function, (lst_small, lcc_small), indims = (indims_cube_1, indims_cube_2), outdims = out_1_dims)

produce:

ERROR: On worker 2:
BoundsError: attempt to access 20×6×120×45 Array{Float32, 4} at index [1:20, 1, 1225:1296, 901:945]
Stacktrace:
  [1] throw_boundserror
    @ ./abstractarray.jl:703
  [2] checkbounds
    @ ./abstractarray.jl:668 [inlined]
  [3] _setindex!
    @ ./multidimensional.jl:929 [inlined]
  [4] setindex!
    @ ./abstractarray.jl:1344 [inlined]
  [5] filldata!
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrays/f8PI0/src/batchgetindex.jl:127
  [6] #disk_getindex_batch!#38
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrays/f8PI0/src/batchgetindex.jl:113
  [7] disk_getindex_batch
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrays/f8PI0/src/batchgetindex.jl:121
  [8] batchgetindex
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrays/f8PI0/src/batchgetindex.jl:72
  [9] getindex_disk
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrays/f8PI0/src/diskarray.jl:28
 [10] getindex
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrays/f8PI0/src/diskarray.jl:177
 [11] readblock!
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrayTools/WsUY6/src/DiskArrayTools.jl:250
 [12] readblock!
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrays/f8PI0/src/subarray.jl:25
 [13] getindex_disk
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrays/f8PI0/src/diskarray.jl:31
 [14] getindex
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrays/f8PI0/src/diskarray.jl:177
 [15] updatear
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/Fe7F8/src/DAT/DAT.jl:639
 [16] #78
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/Fe7F8/src/DAT/DAT.jl:571
 [17] #59
    @ ./tuple.jl:556 [inlined]
 [18] BottomRF
    @ ./reduce.jl:81 [inlined]
 [19] _foldl_impl
    @ ./reduce.jl:62
 [20] foldl_impl
    @ ./reduce.jl:48 [inlined]
 [21] mapfoldl_impl
    @ ./reduce.jl:44 [inlined]
 [22] #mapfoldl#259
    @ ./reduce.jl:170 [inlined]
 [23] #foldl#260
    @ ./reduce.jl:193 [inlined]
 [24] foreach
    @ ./tuple.jl:556 [inlined]
 [25] updatears
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/Fe7F8/src/DAT/DAT.jl:568
 [26] updateinars
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/Fe7F8/src/DAT/DAT.jl:656
 [27] #107
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/Fe7F8/src/DAT/DAT.jl:700
 [28] fnew
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/Fe7F8/src/DAT/DAT.jl:665
 [29] #56
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/ProgressMeter/sN2xr/src/ProgressMeter.jl:1016
 [30] #invokelatest#2
    @ ./essentials.jl:729
 [31] invokelatest
    @ ./essentials.jl:726
 [32] #110
    @ /Net/Groups/BGI/people/dpabon/bin/julia-1.8.5-linux-x86_64/julia-1.8.5/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:285
 [33] run_work_thunk
    @ /Net/Groups/BGI/people/dpabon/bin/julia-1.8.5-linux-x86_64/julia-1.8.5/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:70
 [34] macro expansion
    @ /Net/Groups/BGI/people/dpabon/bin/julia-1.8.5-linux-x86_64/julia-1.8.5/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:285 [inlined]
 [35] #109
    @ ./task.jl:484
Stacktrace:
  [1] (::Base.var"#939#941")(x::Task)
    @ Base ./asyncmap.jl:177
  [2] foreach(f::Base.var"#939#941", itr::Vector{Any})
    @ Base ./abstractarray.jl:2774
  [3] maptwice(wrapped_f::Function, chnl::Channel{Any}, worker_tasks::Vector{Any}, c::DiskArrays.GridChunks{2})
    @ Base ./asyncmap.jl:177
  [4] wrap_n_exec_twice
    @ ./asyncmap.jl:153 [inlined]
  [5] #async_usemap#924
    @ ./asyncmap.jl:103 [inlined]
  [6] #asyncmap#923
    @ ./asyncmap.jl:81 [inlined]
  [7] pmap(f::Function, p::WorkerPool, c::DiskArrays.GridChunks{2}; distributed::Bool, batch_size::Int64, on_error::Nothing, retry_delays::Vector{Any}, retry_check::Nothing)
    @ Distributed /Net/Groups/BGI/people/dpabon/bin/julia-1.8.5-linux-x86_64/julia-1.8.5/share/julia/stdlib/v1.8/Distributed/src/pmap.jl:126
  [8] pmap(f::Function, p::WorkerPool, c::DiskArrays.GridChunks{2})
    @ Distributed /Net/Groups/BGI/people/dpabon/bin/julia-1.8.5-linux-x86_64/julia-1.8.5/share/julia/stdlib/v1.8/Distributed/src/pmap.jl:99
  [9] macro expansion
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/ProgressMeter/sN2xr/src/ProgressMeter.jl:1015 [inlined]
 [10] macro expansion
    @ ./task.jl:454 [inlined]
 [11] macro expansion
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/ProgressMeter/sN2xr/src/ProgressMeter.jl:1014 [inlined]
 [12] macro expansion
    @ ./task.jl:454 [inlined]
 [13] progress_map(::Function, ::Vararg{Any}; mapfun::typeof(pmap), progress::ProgressMeter.Progress, channel_bufflen::Int64, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ ProgressMeter /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/ProgressMeter/sN2xr/src/ProgressMeter.jl:1007
 [14] #progress_pmap#60
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/ProgressMeter/sN2xr/src/ProgressMeter.jl:1032 [inlined]
 [15] pmap_with_data(f::Function, p::WorkerPool, c::DiskArrays.GridChunks{2}; initfunc::Function, progress::ProgressMeter.Progress, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ YAXArrays.DAT /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/au5n4/src/DAT/DAT.jl:668
 [16] pmap_with_data(f::Function, c::DiskArrays.GridChunks{2}; initfunc::Function, kwargs::Base.Pairs{Symbol, ProgressMeter.Progress, Tuple{Symbol}, NamedTuple{(:progress,), Tuple{ProgressMeter.Progress}}})
    @ YAXArrays.DAT /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/au5n4/src/DAT/DAT.jl:673
 [17] runLoop(dc::YAXArrays.DAT.DATConfig{2, 1}, showprog::Bool)
    @ YAXArrays.DAT /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/au5n4/src/DAT/DAT.jl:698
 [18] mapCube(::typeof(simple_function), ::Tuple{YAXArray{Union{Missing, Float32}, 3, DiskArrayTools.CFDiskArray{Float32, 3, Float32, DiskArrays.SubDiskArray{Float32, 3}}, Vector{RangeAxis}}, YAXArray{Union{Missing, Float32}, 4, DiskArrays.SubDiskArray{Union{Missing, Float32}, 4}, Vector{CubeAxis}}}; max_cache::Float64, indims::Tuple{InDims, InDims}, outdims::OutDims, inplace::Bool, ispar::Bool, debug::Bool, include_loopvars::Bool, showprog::Bool, irregular_loopranges::Bool, nthreads::Dict{Int64, Int64}, loopchunksize::Dict{Any, Any}, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ YAXArrays.DAT /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/au5n4/src/DAT/DAT.jl:475
 [19] top-level scope
    @ ~/dpabon/nfdi4earth_oemc/bin/julia/exploratory_analysis/minimal_example_batch_issue.jl:50

Add Initialisation, Base.resize!, Base.append! and concatenation to interface

I had some ideas for possible additions:

It might be cool for something like arr = SomeDiskArray{Float64}(file_handle, undef, 100, 10) to be part of the interface. This would make it possible to treat a DiskArray almost exactly like an Array. Functions like map would "just" work. May be even go so far as to allow arr = SomeDiskArray{Float64}(undef, 100, 10) where the file object is automatically created in the current working dir with some temp name.

If a DiskArray implements the resize! method, we can have an automatic append! be defined for any DiskArray.

For concatenation, we can define a cat!(out, A..., dims=dims) function. If out can be resized, it can be any size, if it cannot, the size of out must be the sum of the sizes of all A. May be use a Trait to specify whether or not a DiskArray can be resized. If we have initialisation defined, we can even have cat work like it does for normal arrays. It'll just initialise a new array and write into it.

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

About this package

Hi @rafaqz ,

as discussed on discourse you showed some interest in defining something like a AbstractDiskArray I am not sure if you have started something like this already, but since I am trying to adapt the indexing behavior of NetCDF.jl and Zarr.jl and want to avoid double work I started this package which might be merged with your efforts. The basic idea is that all a user has to provide is a function that reads a cartesian subrange from a array that is on disk and this package takes care of the remaining index details, translating trailing and missing singleton dimensions etc.

Currently this supports getindex only, but you might get an idea of what my idea for such a package could be. Your opinion would be highly appreciated.

Next I would continue to remove the getindex implementations from Zarr and NetCDF and instead use this interface here, so that any indexing improvement would help both packages.

Fix method ambiguities

(@v1.8) pkg> st
Status `~/.julia/environments/v1.8/Project.toml`
  [4c88cf16] Aqua v0.5.5
  [3c3547ce] DiskArrays v0.3.6

julia> using DiskArrays, Aqua

julia> Aqua.test_all(DiskArrays)
Skipping Base.active_repl
Skipping Base.active_repl_backend
Skipping Base.Filesystem.JL_O_RANDOM
Skipping Base.Filesystem.JL_O_SEQUENTIAL
Skipping Base.Filesystem.JL_O_SHORT_LIVED
Skipping Base.Filesystem.JL_O_TEMPORARY
13 ambiguities found
Ambiguity #1
copyto!(dest::AbstractMatrix, Rdest::CartesianIndices{2, R} where R<:Tuple{OrdinalRange{Int64, Int64}, OrdinalRange{Int64, Int64}}, src::SparseArrays.AbstractSparseMatrixCSC{T}, Rsrc::CartesianIndices{2, R} where R<:Tuple{OrdinalRange{Int64, Int64}, OrdinalRange{Int64, Int64}}) where T in SparseArrays at /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/SparseArrays/src/sparsematrix.jl:481
copyto!(dest::DiskArrays.AbstractDiskArray, Rdest::CartesianIndices, src::AbstractArray, Rsrc::CartesianIndices) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:8

Possible fix, define
  copyto!(::DiskArrays.AbstractDiskArray{T, 2} where T, ::CartesianIndices{2, R} where R<:Tuple{OrdinalRange{Int64, Int64}, OrdinalRange{Int64, Int64}}, ::SparseArrays.AbstractSparseMatrixCSC{T}, ::CartesianIndices{2, R} where R<:Tuple{OrdinalRange{Int64, Int64}, OrdinalRange{Int64, Int64}}) where T

Ambiguity #2
reverse(a::DiskArrays.AbstractDiskArray, dims) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:23
reverse(A::AbstractVector, start::Integer) in Base at array.jl:1803

Possible fix, define
  reverse(::DiskArrays.AbstractDiskArray{T, 1} where T, ::Integer)

Ambiguity #3
copyto!(dest::AbstractArray, D::SuiteSparse.CHOLMOD.Dense) in SuiteSparse.CHOLMOD at /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/SuiteSparse/src/cholmod.jl:830
copyto!(dest::DiskArrays.AbstractDiskArray, source::AbstractArray) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:5

Possible fix, define
  copyto!(::DiskArrays.AbstractDiskArray, ::SuiteSparse.CHOLMOD.Dense)

Ambiguity #4
copyto!(A::AbstractMatrix, B::SparseArrays.AbstractSparseMatrixCSC) in SparseArrays at /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/SparseArrays/src/sparsematrix.jl:458
copyto!(dest::DiskArrays.AbstractDiskArray, source::AbstractArray) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:5

Possible fix, define
  copyto!(::DiskArrays.AbstractDiskArray{T, 2} where T, ::SparseArrays.AbstractSparseMatrixCSC)

Ambiguity #5
copyto!(dest::AbstractMatrix, src::LinearAlgebra.AbstractQ) in LinearAlgebra at /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/LinearAlgebra/src/qr.jl:599
copyto!(dest::DiskArrays.AbstractDiskArray, source::AbstractArray) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:5

Possible fix, define
  copyto!(::DiskArrays.AbstractDiskArray{T, 2} where T, ::LinearAlgebra.AbstractQ)

Ambiguity #6
commonlength(::Tuple{}, b) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/diskarray.jl:154
commonlength(a, ::Tuple{}) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/diskarray.jl:155

Possible fix, define
  commonlength(::Tuple{}, ::Tuple{})

Ambiguity #7
_commonlength(a1, b1, ::Tuple{}, b) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/diskarray.jl:157
_commonlength(a1, b1, a, ::Tuple{}) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/diskarray.jl:158

Possible fix, define
  _commonlength(::Any, ::Any, ::Tuple{}, ::Tuple{})

Ambiguity #8
copyto!(dest::AbstractArray, source::DiskArrays.AbstractDiskArray) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:6
copyto!(dest::PermutedDimsArray, src::AbstractArray) in Base.PermutedDimsArrays at permuteddimsarray.jl:215

Possible fix, define
  copyto!(::PermutedDimsArray, ::DiskArrays.AbstractDiskArray)

Ambiguity #9
copyto!(dest::AbstractArray{T}, D::SuiteSparse.CHOLMOD.Dense{T}) where T<:Union{Float64, ComplexF64} in SuiteSparse.CHOLMOD at /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/SuiteSparse/src/cholmod.jl:828
copyto!(dest::DiskArrays.AbstractDiskArray, source::AbstractArray) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:5

Possible fix, define
  copyto!(::DiskArrays.AbstractDiskArray{T}, ::SuiteSparse.CHOLMOD.Dense{T}) where T<:Union{Float64, ComplexF64}

Ambiguity #10
copyto!(A::SparseArrays.SparseVector, B::AbstractVector) in SparseArrays at /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/SparseArrays/src/sparsevector.jl:501
copyto!(dest::AbstractArray, source::DiskArrays.AbstractDiskArray) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:6

Possible fix, define
  copyto!(::SparseArrays.SparseVector, ::DiskArrays.AbstractDiskArray{T, 1} where T)

Ambiguity #11
copyto!(dest::AbstractArray, source::DiskArrays.AbstractDiskArray) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:6
copyto!(dest::PermutedDimsArray{T, N}, src::AbstractArray{T, N}) where {T, N} in Base.PermutedDimsArrays at permuteddimsarray.jl:211

Possible fix, define
  copyto!(::PermutedDimsArray{T, N}, ::DiskArrays.AbstractDiskArray{T, N}) where {T, N}

Ambiguity #12
_reshape(parent::DiskArrays.AbstractDiskArray, dims::Tuple{Vararg{Int64, var"#164#N"}}) where var"#164#N" in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/reshape.jl:79
_reshape(v::AbstractVector, dims::Tuple{Int64}) in Base at reshapedarray.jl:168

Possible fix, define
  _reshape(::DiskArrays.AbstractDiskArray{T, 1} where T, ::Tuple{Int64})

Ambiguity #13
copyto!(dest::AbstractMatrix{T}, D::SuiteSparse.CHOLMOD.Dense{T}) where T<:Union{Float64, ComplexF64} in SuiteSparse.CHOLMOD at /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/SuiteSparse/src/cholmod.jl:829
copyto!(dest::DiskArrays.AbstractDiskArray, source::AbstractArray) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:5

Possible fix, define
  copyto!(::DiskArrays.AbstractDiskArray{T, 2}, ::SuiteSparse.CHOLMOD.Dense{T}) where T<:Union{Float64, ComplexF64}

Method ambiguity: Test Failed at /Users/yeesian/.julia/packages/Aqua/HWLbM/src/ambiguities.jl:117
  Expression: success(pipeline(cmd; stdout = stdout, stderr = stderr))
Stacktrace:
 [1] macro expansion
   @ /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/Test/src/Test.jl:464 [inlined]
 [2] _test_ambiguities(packages::Vector{Base.PkgId}; color::Nothing, exclude::Vector{Any}, detect_ambiguities_options::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Aqua ~/.julia/packages/Aqua/HWLbM/src/ambiguities.jl:117
Test Summary:    | Fail  Total  Time
Method ambiguity |    1      1  1.7s
ERROR: Some tests did not pass: 0 passed, 1 failed, 0 errored, 0 broken.

eachchunk fall-back for transposed or offset arrays?

This is low priority issue for me, but when DiskArrays are used for memory arrays, the it might be useful to adapt the default chunksize for transposed arrays:

julia> DiskArrays.default_chunk_size[] = 1
1

julia> eachchunk(ones(4000,4000))
1×130 DiskArrays.GridChunks{2, Tuple{DiskArrays.RegularChunks, DiskArrays.RegularChunks}}:
 (1:4000, 1:31)  (1:4000, 32:62)  (1:4000, 63:93)  …  (1:4000, 4000:4000)

julia> eachchunk(ones(4000,4000)')
1×130 DiskArrays.GridChunks{2, Tuple{DiskArrays.RegularChunks, DiskArrays.RegularChunks}}:
 (1:4000, 1:31)  (1:4000, 32:62)  (1:4000, 63:93)  …  (1:4000, 4000:4000)

For a transposed array I would expect that the first chunk size of (1:31, 1:4000). I guess one can use stride to check this.

I am note sure if OffsetArray is supposed to be sported but currently all indices from eachchunk are ones-based

using  DiskArrays, OffsetArray
DiskArrays.default_chunk_size[] = 1
OA = OffsetArray(ones(4000,4000),(-1,-1));
c = eachchunk(OA);
OA[last(c)...]
# ERROR: BoundsError: attempt to access 4000×4000 OffsetArray(::Matrix{Float64}, 0:3999, 0:3999) with eltype Float64 with indices 0:3999×0:3999 at index [1:4000, 4000:4000]

None of these issue affect me currently in any production code, but I thought it might be useful to notify you about it.

I am using DiskArrays v0.3.22

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.