francescalted / caterva Goto Github PK
View Code? Open in Web Editor NEWA multidimensional data container on top of Blosc2.
Home Page: https://caterva.readthedocs.io
License: Other
A multidimensional data container on top of Blosc2.
Home Page: https://caterva.readthedocs.io
License: Other
test_squeeze.c and test_squeeze_index.c show errors when we run them with 2 threads and 1-idim shapes. The bug reason seems to be that shapes are too little.
COPYING
file is GNU GPL
LICENSE
file is BSD like.
the documentation overall is kind of hinting that the LICENSE file is the one for further contributions
I am not the only one confused as Github classifies the project as Unknown, GPL-2.0 licenses found
can you please clarify ?
Different errors have appeared in Blosc functions calls, so return codes must be analyzed more frequently
This would allow to extend/shrink an array in different dimensions. I suggest a new function with a signature similar to this:
/**
* @brief Resize a caterva array
*
* Changes the shape of the caterva array by growing or shrinking one or more dimensions.
*
* @param ctx The caterva context to be used.
* @param array The caterva array.
* @param new_dims New dimensions of the array.
*
* @return An error code
*/
int caterva_resize(caterva_ctx_t *ctx, caterva_array_t *array, int *new_dims);
Rename the filled param of a caterva container to full.
it would be desirable to support arrays whose shape is (..., 0, ...)
. This would imply that the array has no data.
I have a 1GB dataset that contains a lot of identical (zero) bytes. It compresses down to ~640KB. The dimensions are:
Element size: 1 byte
Entire data: 1024x1024x1024
Chunk: 64x64x64
Block: 32x16x64
I have a simple test that just calls caterva_open() followed by caterva_to_buffer() to decompress the entire 1GB buffer. After hacking caterva_open() to take a blosc2_io* instead of forcing BLOSC2_IO_DEFAULTS, and adding a counter to see how often the i/o functions were called, I was presented with see the following output:
decompressed 1073741824B in 1524ms
open() called 40967 times
close() called 40967 times
tell() called 0 times
seek() called 40963 times
write() called times
read() called 45064 times
trunc() called 0 times
Per my math, it looks like we call open()+close() 2x per chunk (8192 total) and 1x per block (32768 total), even if those blocks are tiny (in this case each block is less than 20B compressed). This is still relatively fast on my system which has Linux's cached in-memory file i/o, but if the file functions were replaced with anything that had significant latency (remote file access), this could make things slow to a crawl. We also call malloc() in every open and free() in every close, so there are 40k of those as well.
Consider opening the file and then leaving it open for the duration of the decompression call. If we need to access sparse regions of the file, stream/buffer them in order (sequential disk byte order).
It would be desirable to be able to fill an empty matrix (superchunk) with a caterva method. For that, the following three methods would be necessary:
caterva_has_next()
: It indicates if the array is filled or if some partition is missing.caterva_next()
: It indicates the shape of the next block and the position of the first element in the array.caterva_append()
: A block buffer will be passed with the shape given by the above function.The use of this methods would be:
while(caterva_has_next()) {
caterva_next()
...
caterva_append()
}
For now, if an array is based on a blosc schunk, the slice must be also an array based on a schunk.
It can be generalized so that it does not depend on how the origin matrix is created.
Project repo:
include/
caterva.h
caterva/
caterva_private.h
caterva.c
caterva_blosc.c
caterva_plainbuffer.c
Example of usage:
caterva_get_slice() {
if (c->storage == PLAINBUFFER) {
caterva_plainbuffer_get_slice()
else {
caterva_blosc_get_slice()
}
}
Add to the development section in docs:
Help for newcomers
Bug reports
Contributing code
Code style
...
It would be desirable to add a boolean parameter filled
to show if the container is filled or not.
I have been putting some time to package Caterva for conda-forge. I based the package on the 0.3.3 version released a few weeks ago, and for the conda-forge packaging I needed the changes stated here: https://github.com/Blosc/staged-recipes/blob/caterva/recipes/caterva/caterva.patch
Again, these were made against 0.3.3 and need to to adapted to what is in master now.
Implement a save
function:
caterva_save(caterva_ctx_t *ctx, char* urlpath, caterva_array_t *array);
It should be useful to have another layer of chunking when defining the partitions. Blosc2 splits frames in chunks and blocks; right now, only the chunks can be multidimensional in Caterva. The idea would be add this multidimensionality capability to the blocks too, and also enable the slice selection machinery to use that for reducing the amount of data to be read.
When running the tests, the following tests fail:
Some is due to the fact that we check if the function remove_urlpath
failed which it's true if the file/directory does not exist yet.
When I run caterva_blosc_append(), valgrind warns me about a leak in blosc2_new_frame(). It seems lthat the created frame is not freed:
Leak_DefinitelyLost
frame.c
232 (48 direct, 184 indirect) bytes in 1 blocks are definitely lost in loss record 2 of 2
malloc
blosc2_new_frame
caterva_blosc_array_empty
caterva_array_empty
/* Create a new (empty) frame /
blosc2_frame blosc2_new_frame(const char* fname) {
blosc2_frame* new_frame = malloc(sizeof(blosc2_frame)); // PROBLEM LINE
memset(new_frame, 0, sizeof(blosc2_frame));
if (fname != NULL) {
char* new_fname = malloc(strlen(fname) + 1); // + 1 for the trailing NULL
new_frame->fname = strcpy(new_fname, fname);
}
Up to now, in Caterva the first parameter of every function is a caterva_ctx_t *
(it contains malloc/free functions along with other general parameters).
I think we should consider adding it to caterva_resize
to unify the design.
What is your point of view? @FrancescAlted @martaiborra
The get_slice
does not copy the metalayers and the vlmetalayers.
Implement a NaN constructor (only for 4/8 itemsize) using Blosc2 special_values machinery.
There is an error in copy method with the filled
param. When a array is copied, the filled value is not defined and some times can be false when the array is filled.
It would be desirable change the free methods API:
caterva_free_array(caterva_array_t *carr)
-> caterva_free_array(caterva_array_t **carr)
.caterva_free_ctx(caterva_ctx_t *ctx)
-> caterva_free_ctx(caterva_ctx_t **ctx)
.This should be a subclass of Container
, with a numpy dtype stored in a metalayer, and an additional __array__
method for better compatibility with the ecosystem.
Its methods should overload the existing ones in Container
, but using the new self.dtype
by default.
This typo needs to be fixed across all the sources.
Unfortunately, the change will break the API back-compatibility, but I feel it's early enough in the project to break it.
For the moment, a resize of an array with original dimension = 0 (e.g. with original shape {3, 0, 5}) won't be supported to avoid dividing by 0 when updating the shapes. Zarr has the same behavior (see example below).
In a future, we should somehow support it.
import zarr
z = zarr.zeros(shape=(10000, 0), chunks=(1000, 0))
z.resize(20000, 0) # ZeroDivisionError
z.resize(20000, 10) # ZeroDivisionError
In caterva.h there are two free functions with parameters that I am not sure if have the correct structures:
int caterva_ctx_free(caterva_ctx_t **ctx);
int caterva_free(caterva_ctx_t *ctx, caterva_array_t **array);
Is it right that the structures are ** instead of *?
Title says it all.
Update caterva_fill
using the method caterva_append
. Another option is remove this method.
cpplint is a code checker for C/C++. It can be used in the CI to check the code style, what do you think @FrancescAlted?
Currently, an error message is sent to stderr whenever there is a problem. It would be better to activate those errors only when and environment variable (e.g. CATERVA_TRACE, or CATERVA_PRINT_ERRORS) is activated.
Use the concept of strides to simplify the code
Make possible to create dynamic and static libraries from top-level CMakeLists.txt
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.