Code Monkey home page Code Monkey logo

caterva's People

Contributors

aleixalcacer avatar dimitripapadopoulos avatar francescalted avatar martaiborra avatar mkitti avatar muusbolla avatar oscargm98 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

caterva's Issues

Contradictory license and copying files

COPYING file is GNU GPL
LICENSE file is BSD like.
the documentation overall is kind of hinting that the LICENSE file is the one for further contributions

I am not the only one confused as Github classifies the project as Unknown, GPL-2.0 licenses found

can you please clarify ?

Implement a resize functionality

This would allow to extend/shrink an array in different dimensions. I suggest a new function with a signature similar to this:

/**
 * @brief Resize a caterva array
 *
 * Changes the shape of the caterva array by growing or shrinking one or more dimensions.
 *
 * @param ctx The caterva context to be used.
 * @param array The caterva array.
 * @param new_dims New dimensions of the array.
 *
 * @return An error code
 */
int caterva_resize(caterva_ctx_t *ctx, caterva_array_t *array, int *new_dims);

Support 0 dimension size

it would be desirable to support arrays whose shape is (..., 0, ...). This would imply that the array has no data.

open() and close() are called way too many times

I have a 1GB dataset that contains a lot of identical (zero) bytes. It compresses down to ~640KB. The dimensions are:
Element size: 1 byte
Entire data: 1024x1024x1024
Chunk: 64x64x64
Block: 32x16x64

I have a simple test that just calls caterva_open() followed by caterva_to_buffer() to decompress the entire 1GB buffer. After hacking caterva_open() to take a blosc2_io* instead of forcing BLOSC2_IO_DEFAULTS, and adding a counter to see how often the i/o functions were called, I was presented with see the following output:

decompressed 1073741824B in 1524ms
open() called 40967 times
close() called 40967 times
tell() called 0 times
seek() called 40963 times
write() called times
read() called 45064 times
trunc() called 0 times

Per my math, it looks like we call open()+close() 2x per chunk (8192 total) and 1x per block (32768 total), even if those blocks are tiny (in this case each block is less than 20B compressed). This is still relatively fast on my system which has Linux's cached in-memory file i/o, but if the file functions were replaced with anything that had significant latency (remote file access), this could make things slow to a crawl. We also call malloc() in every open and free() in every close, so there are 40k of those as well.

Consider opening the file and then leaving it open for the duration of the decompression call. If we need to access sparse regions of the file, stream/buffer them in order (sequential disk byte order).

Support to add partitions to an empty schunk

It would be desirable to be able to fill an empty matrix (superchunk) with a caterva method. For that, the following three methods would be necessary:

  • caterva_has_next(): It indicates if the array is filled or if some partition is missing.
  • caterva_next(): It indicates the shape of the next block and the position of the first element in the array.
  • caterva_append(): A block buffer will be passed with the shape given by the above function.

The use of this methods would be:

while(caterva_has_next()) {
    caterva_next()
    ...
    caterva_append()
}

Generalize get slice

For now, if an array is based on a blosc schunk, the slice must be also an array based on a schunk.

It can be generalized so that it does not depend on how the origin matrix is created.

Separate buffer/blosc storage in multiple files

Project repo:

include/
    caterva.h
caterva/
    caterva_private.h
    caterva.c
    caterva_blosc.c
    caterva_plainbuffer.c

Example of usage:

caterva_get_slice() {
    if (c->storage == PLAINBUFFER) {
        caterva_plainbuffer_get_slice()
    else {
        caterva_blosc_get_slice()
    }
}

Add a filled parameter

It would be desirable to add a boolean parameter filled to show if the container is filled or not.

Save arrays on disk

Implement a save function:

caterva_save(caterva_ctx_t *ctx, char* urlpath, caterva_array_t *array);

Support the concept of multidimensional block

It should be useful to have another layer of chunking when defining the partitions. Blosc2 splits frames in chunks and blocks; right now, only the chunks can be multidimensional in Caterva. The idea would be add this multidimensionality capability to the blocks too, and also enable the slice selection machinery to use that for reducing the amount of data to be read.

problems in tests

When running the tests, the following tests fail:

  • caterva_test_copy
  • caterva_test_persistency
  • caterva_test_serialize

Some is due to the fact that we check if the function remove_urlpath failed which it's true if the file/directory does not exist yet.

Possible leak at caterva_blosc_array_empty()

valg

When I run caterva_blosc_append(), valgrind warns me about a leak in blosc2_new_frame(). It seems lthat the created frame is not freed:

Leak_DefinitelyLost
frame.c
232 (48 direct, 184 indirect) bytes in 1 blocks are definitely lost in loss record 2 of 2
malloc
blosc2_new_frame
caterva_blosc_array_empty
caterva_array_empty

/* Create a new (empty) frame /
blosc2_frame
blosc2_new_frame(const char* fname) {
blosc2_frame* new_frame = malloc(sizeof(blosc2_frame)); // PROBLEM LINE
memset(new_frame, 0, sizeof(blosc2_frame));
if (fname != NULL) {
char* new_fname = malloc(strlen(fname) + 1); // + 1 for the trailing NULL
new_frame->fname = strcpy(new_fname, fname);
}

caterva_resize API

Up to now, in Caterva the first parameter of every function is a caterva_ctx_t * (it contains malloc/free functions along with other general parameters).

I think we should consider adding it to caterva_resize to unify the design.

What is your point of view? @FrancescAlted @martaiborra

NaN constructor

Implement a NaN constructor (only for 4/8 itemsize) using Blosc2 special_values machinery.

Copy method error with filled param

There is an error in copy method with the filled param. When a array is copied, the filled value is not defined and some times can be false when the array is filled.

Update free methods API

It would be desirable change the free methods API:

  • caterva_free_array(caterva_array_t *carr) -> caterva_free_array(caterva_array_t **carr).
  • caterva_free_ctx(caterva_ctx_t *ctx) -> caterva_free_ctx(caterva_ctx_t **ctx).

Implement an Array object

This should be a subclass of Container, with a numpy dtype stored in a metalayer, and an additional __array__ method for better compatibility with the ecosystem.

Its methods should overload the existing ones in Container, but using the new self.dtype by default.

Syntax affair

issue

In order to avoid this type of sentences, we could consider to change the word "size" to "length" for refering to the number of elements of the shape/chunkshape/blockshape/...

sequencial โ†’ sequential

This typo needs to be fixed across all the sources.

Unfortunately, the change will break the API back-compatibility, but I feel it's early enough in the project to break it.

Support resize array whose shape has a dim = 0

For the moment, a resize of an array with original dimension = 0 (e.g. with original shape {3, 0, 5}) won't be supported to avoid dividing by 0 when updating the shapes. Zarr has the same behavior (see example below).
In a future, we should somehow support it.

import zarr

z = zarr.zeros(shape=(10000, 0), chunks=(1000, 0))
z.resize(20000, 0)  # ZeroDivisionError
z.resize(20000, 10) # ZeroDivisionError

Caterva frees

In caterva.h there are two free functions with parameters that I am not sure if have the correct structures:
int caterva_ctx_free(caterva_ctx_t **ctx);
int caterva_free(caterva_ctx_t *ctx, caterva_array_t **array);

Is it right that the structures are ** instead of *?

Update caterva_fill

Update caterva_fill using the method caterva_append. Another option is remove this method.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.