The caterva from francescalted

Squeeze tests don't work correctly for multithreading

test_squeeze.c and test_squeeze_index.c show errors when we run them with 2 threads and 1-idim shapes. The bug reason seems to be that shapes are too little.

Contradictory license and copying files

COPYING file is GNU GPL
LICENSE file is BSD like.
the documentation overall is kind of hinting that the LICENSE file is the one for further contributions

I am not the only one confused as Github classifies the project as Unknown, GPL-2.0 licenses found

can you please clarify ?

Overhaul of return codes of Blosc functions calls

Different errors have appeared in Blosc functions calls, so return codes must be analyzed more frequently

Implement a resize functionality

This would allow to extend/shrink an array in different dimensions. I suggest a new function with a signature similar to this:

/**
 * @brief Resize a caterva array
 *
 * Changes the shape of the caterva array by growing or shrinking one or more dimensions.
 *
 * @param ctx The caterva context to be used.
 * @param array The caterva array.
 * @param new_dims New dimensions of the array.
 *
 * @return An error code
 */
int caterva_resize(caterva_ctx_t *ctx, caterva_array_t *array, int *new_dims);

Bug when tests are executed at Azure (windows-release)

Rename filled param to full

Rename the filled param of a caterva container to full.

Support 0 dimension size

it would be desirable to support arrays whose shape is (..., 0, ...). This would imply that the array has no data.

open() and close() are called way too many times

I have a 1GB dataset that contains a lot of identical (zero) bytes. It compresses down to ~640KB. The dimensions are:
Element size: 1 byte
Entire data: 1024x1024x1024
Chunk: 64x64x64
Block: 32x16x64

I have a simple test that just calls caterva_open() followed by caterva_to_buffer() to decompress the entire 1GB buffer. After hacking caterva_open() to take a blosc2_io* instead of forcing BLOSC2_IO_DEFAULTS, and adding a counter to see how often the i/o functions were called, I was presented with see the following output:

decompressed 1073741824B in 1524ms
open() called 40967 times
close() called 40967 times
tell() called 0 times
seek() called 40963 times
write() called times
read() called 45064 times
trunc() called 0 times

Per my math, it looks like we call open()+close() 2x per chunk (8192 total) and 1x per block (32768 total), even if those blocks are tiny (in this case each block is less than 20B compressed). This is still relatively fast on my system which has Linux's cached in-memory file i/o, but if the file functions were replaced with anything that had significant latency (remote file access), this could make things slow to a crawl. We also call malloc() in every open and free() in every close, so there are 40k of those as well.

Consider opening the file and then leaving it open for the duration of the decompression call. If we need to access sparse regions of the file, stream/buffer them in order (sequential disk byte order).

Support to add partitions to an empty schunk

It would be desirable to be able to fill an empty matrix (superchunk) with a caterva method. For that, the following three methods would be necessary:

caterva_has_next(): It indicates if the array is filled or if some partition is missing.
caterva_next(): It indicates the shape of the next block and the position of the first element in the array.
caterva_append(): A block buffer will be passed with the shape given by the above function.

The use of this methods would be:

while(caterva_has_next()) {
    caterva_next()
    ...
    caterva_append()
}

Generalize get slice

For now, if an array is based on a blosc schunk, the slice must be also an array based on a schunk.

It can be generalized so that it does not depend on how the origin matrix is created.

Separate buffer/blosc storage in multiple files

Project repo:

include/
    caterva.h
caterva/
    caterva_private.h
    caterva.c
    caterva_blosc.c
    caterva_plainbuffer.c

Example of usage:

caterva_get_slice() {
    if (c->storage == PLAINBUFFER) {
        caterva_plainbuffer_get_slice()
    else {
        caterva_blosc_get_slice()
    }
}

Extend the development section

Add to the development section in docs:

Help for newcomers
Bug reports
Contributing code
Code style
...

Add a filled parameter

It would be desirable to add a boolean parameter filled to show if the container is filled or not.

Update cmake config file for conda-forge

I have been putting some time to package Caterva for conda-forge. I based the package on the 0.3.3 version released a few weeks ago, and for the conda-forge packaging I needed the changes stated here: https://github.com/Blosc/staged-recipes/blob/caterva/recipes/caterva/caterva.patch

Again, these were made against 0.3.3 and need to to adapted to what is in master now.

Save arrays on disk

Implement a save function:

caterva_save(caterva_ctx_t *ctx, char* urlpath, caterva_array_t *array);

Wrap functions of blosc2 metalayers

Support the concept of multidimensional block

It should be useful to have another layer of chunking when defining the partitions. Blosc2 splits frames in chunks and blocks; right now, only the chunks can be multidimensional in Caterva. The idea would be add this multidimensionality capability to the blocks too, and also enable the slice selection machinery to use that for reducing the amount of data to be read.

problems in tests

When running the tests, the following tests fail:

caterva_test_copy
caterva_test_persistency
caterva_test_serialize

Some is due to the fact that we check if the function remove_urlpath failed which it's true if the file/directory does not exist yet.

Possible leak at caterva_blosc_array_empty()

When I run caterva_blosc_append(), valgrind warns me about a leak in blosc2_new_frame(). It seems lthat the created frame is not freed:

Leak_DefinitelyLost
frame.c
232 (48 direct, 184 indirect) bytes in 1 blocks are definitely lost in loss record 2 of 2
malloc
blosc2_new_frame
caterva_blosc_array_empty
caterva_array_empty

/* Create a new (empty) frame /
blosc2_frame blosc2_new_frame(const char* fname) {
blosc2_frame* new_frame = malloc(sizeof(blosc2_frame)); // PROBLEM LINE
memset(new_frame, 0, sizeof(blosc2_frame));
if (fname != NULL) {
char* new_fname = malloc(strlen(fname) + 1); // + 1 for the trailing NULL
new_frame->fname = strcpy(new_fname, fname);
}

caterva_resize API

Up to now, in Caterva the first parameter of every function is a caterva_ctx_t * (it contains malloc/free functions along with other general parameters).

I think we should consider adding it to caterva_resize to unify the design.

What is your point of view? @FrancescAlted @martaiborra

Get slice does not copy metadata

The get_slice does not copy the metalayers and the vlmetalayers.

NaN constructor

Implement a NaN constructor (only for 4/8 itemsize) using Blosc2 special_values machinery.

Copy method error with filled param

There is an error in copy method with the filled param. When a array is copied, the filled value is not defined and some times can be false when the array is filled.

Update free methods API

It would be desirable change the free methods API:

caterva_free_array(caterva_array_t *carr) -> caterva_free_array(caterva_array_t **carr).
caterva_free_ctx(caterva_ctx_t *ctx) -> caterva_free_ctx(caterva_ctx_t **ctx).

Implement an Array object

This should be a subclass of Container, with a numpy dtype stored in a metalayer, and an additional __array__ method for better compatibility with the ecosystem.

Its methods should overload the existing ones in Container, but using the new self.dtype by default.

Syntax affair

In order to avoid this type of sentences, we could consider to change the word "size" to "length" for refering to the number of elements of the shape/chunkshape/blockshape/...

sequencial → sequential

This typo needs to be fixed across all the sources.

Unfortunately, the change will break the API back-compatibility, but I feel it's early enough in the project to break it.

Support resize array whose shape has a dim = 0

For the moment, a resize of an array with original dimension = 0 (e.g. with original shape {3, 0, 5}) won't be supported to avoid dividing by 0 when updating the shapes. Zarr has the same behavior (see example below).
In a future, we should somehow support it.

import zarr

z = zarr.zeros(shape=(10000, 0), chunks=(1000, 0))
z.resize(20000, 0)  # ZeroDivisionError
z.resize(20000, 10) # ZeroDivisionError

Caterva frees

In caterva.h there are two free functions with parameters that I am not sure if have the correct structures:
int caterva_ctx_free(caterva_ctx_t **ctx);
int caterva_free(caterva_ctx_t *ctx, caterva_array_t **array);

Is it right that the structures are ** instead of *?

Add support for CI

Title says it all.

Update caterva_fill

Update caterva_fill using the method caterva_append. Another option is remove this method.

Use cpplint

cpplint is a code checker for C/C++. It can be used in the CI to check the code style, what do you think @FrancescAlted?

Missing splitmode and compcodecmeta parameters at caterva_config_t

There are two new parameters we have to add to caterva_config_t structure (in caterva.h) in order to make the udcodecs system work correctly.
We must implement them in CATERVA_CONFIG_DEFAULTS (caterva.h) and caterva_blosc_array_empty (caterva_blosc.c) too:

francescalted / caterva Goto Github PK

caterva's People

Contributors

Stargazers

Watchers

Forkers

caterva's Issues

Recommend Projects

Recommend Topics

Recommend Org