Code Monkey home page Code Monkey logo

tiledb-py's Introduction

TileDB logo

Build Status Anaconda download count badge

TileDB-Py

TileDB-Py is a Python interface to the TileDB Storage Engine.

Quick Links

Quick Installation

TileDB-Py is available from either PyPI with pip:

pip install tiledb

or from conda-forge with conda or mamba:

conda install -c conda-forge tiledb-py

Dataframes functionality (tiledb.from_pandas, Array.df[]) requires Pandas 1.0 or higher, and PyArrow 1.0 or higher.

Contributing

We welcome contributions, please see CONTRIBUTING.md for suggestions and development-build instructions. For larger features, please open an issue to discuss goals and approach in order to ensure a smooth PR integration and review process.

tiledb-py's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tiledb-py's Issues

Segfault on creating a Ctx

Installation using conda with Python 3.6.2 on Ubuntu 16.04.

tiledb                    1.2.2                h650255c_2    conda-forge
tiledb-py                 0.1.1                    py36_0    conda-forge

Trying to create a Ctx causes a segfault:

In [1]: import tiledb

In [2]: ctx = tiledb.Ctx()
*** Error in `/home/nezar/miniconda3/envs/py36/bin/python': free(): invalid pointer: 0x00007f1092c5bec0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f109ef5c7e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f109ef6537a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f109ef6953c]
/home/nezar/miniconda3/envs/py36/lib/python3.6/site-packages/tiledb/../../../libtiledb.so(+0x86e39)[0x7f10939cae39]
/home/nezar/miniconda3/envs/py36/lib/python3.6/site-packages/tiledb/../../../libtiledb.so(+0x8cdcc)[0x7f10939d0dcc]
/home/nezar/miniconda3/envs/py36/lib/python3.6/site-packages/tiledb/../../../libtiledb.so(+0x9f5f8)[0x7f10939e35f8]
/home/nezar/miniconda3/envs/py36/lib/python3.6/site-packages/tiledb/../../../libtiledb.so(+0x173cc1)[0x7f1093ab7cc1]
/home/nezar/miniconda3/envs/py36/lib/python3.6/site-packages/tiledb/../../../libtiledb.so(tiledb_ctx_create+0xd1)[0x7f109399d011]
/home/nezar/miniconda3/envs/py36/lib/python3.6/site-packages/tiledb/libtiledb.cpython-36m-x86_64-linux-gnu.so(+0x89ec4)[0x7f1093db6ec4]
/home/nezar/miniconda3/envs/py36/bin/../lib/libpython3.6m.so.1.0(+0xda2ac)[0x7f109febe2ac]
/home/nezar/miniconda3/envs/py36/bin/../lib/libpython3.6m.so.1.0(_PyObject_FastCallDict+0x9e)[0x7f109fe4fc1e]

...

Read failed; Trying to read beyond buffer size

  1. I had created KV array of tiledb, and stored around 10000 key-value pairs (tiledb directory sizing upto 18MB)
  2. I had consolidated KV array after writes
  3. I started to read the KV array based on those 10000 keys (just to get a benchmark on read performance), but it throwed an error
File "key_value.py", line 55, in store_vectors
    r = self.key_value_instance[v["key"]]
  File "tiledb/libtiledb.pyx", line 1567, in tiledb.libtiledb.KV.__getitem__
  File "tiledb/libtiledb.pyx", line 120, in tiledb.libtiledb.check_error
  File "tiledb/libtiledb.pyx", line 116, in tiledb.libtiledb._raise_ctx_err
  File "tiledb/libtiledb.pyx", line 101, in tiledb.libtiledb._raise_tiledb_error
tiledb.libtiledb.TileDBError: [TileDB::Buffer] Error: Read failed; Trying to read beyond buffer size

I don't get this error if I skip step 2 (consolidation of the updates), but my read performance drops to around 1.5 seconds for each key read

Allow optional schema argument(s) for `from_numpy` method

This would be convenient for creating a TileDB array from a Numpy array while still allowing for control over tile extents, compressors, etc. We could either allow passing a schema object, or instead individual arguments to control aspects like the tile extents.

Cannot read from float sparse array

To reproduce:

  1. Download nefazodone.raw.

  2. Run script:

import sys
import tiledb
import numpy as np

ctx = tiledb.Ctx()

if '--init' in sys.argv:
    tiledb.remove(ctx, 'spec')

    dom = tiledb.Domain(ctx,
                        tiledb.Dim(ctx, name='scan', domain=(1, 506), tile=20, dtype=float),
                        tiledb.Dim(ctx, name='mz', domain=(0, 2000), tile=10, dtype=float))
    schema = tiledb.ArraySchema(ctx, domain=dom, sparse=True, capacity=1024,
                                attrs=[tiledb.Attr(ctx, name='intensity', dtype=float, compressor=('lz4', 0))],
                                coords_compressor=('lz4', 0))
    spec = tiledb.SparseArray.create('spec', schema)

    with tiledb.SparseArray(ctx, 'spec', mode='w') as spec:
        npoints = 27_464_448
        scan_arr = np.zeros(npoints)
        mz_arr = np.zeros(npoints)
        intens_arr = np.zeros(npoints)

        tiledb.stats_enable()

        i = 0
        scan = 0
        for line in open('nefazodone.raw'):
            line = line.strip()
            if line.startswith('RetTime='):
                scan += 1
            elif line.startswith('Mz='):
                mz, intens = line.split('=')[1].split(' ')
                scan_arr[i] = scan
                mz_arr[i] = float(mz)
                intens_arr[i] = float(intens)
                i += 1

        spec[scan_arr, mz_arr] = {'intensity': intens_arr}
        assert i == npoints
        tiledb.stats_dump()
        tiledb.stats_disable()

with tiledb.SparseArray(ctx, 'spec', mode='r') as spec:
    print(spec.nonempty_domain())
    print(spec.domain)
    tiledb.stats_enable()
    data = spec[9.5:10.5, :]
    tiledb.stats_dump()
    tiledb.stats_disable()
    print(data['intensity'])

    tiledb.stats_enable()
    data = spec[468.5:469.5, :]
    tiledb.stats_dump()
    tiledb.stats_disable()
    print(data['intensity'])
> python3 test.py --init

It outputs

===================================== TileDB Statistics Report =======================================

Individual function statistics:
  Function name                                                          # calls       Total time (ns)
  ----------------------------------------------------------------------------------------------------
  compressor_blosc_compress,                                                   0,                   0
  compressor_blosc_decompress,                                                 0,                   0
  compressor_bzip_compress,                                                    0,                   0
  compressor_bzip_decompress,                                                  0,                   0
  compressor_dd_compress,                                                      0,                   0
  compressor_dd_decompress,                                                    0,                   0
  compressor_gzip_compress,                                                   40,            64836000
  compressor_gzip_decompress,                                                  0,                   0
  compressor_lz4_compress,                                                 80463,           914778000
  compressor_lz4_decompress,                                                   0,                   0
  compressor_rle_compress,                                                     0,                   0
  compressor_rle_decompress,                                                   0,                   0
  compressor_zstd_compress,                                                    0,                   0
  compressor_zstd_decompress,                                                  0,                   0
  cache_lru_evict,                                                             0,                   0
  cache_lru_insert,                                                            0,                   0
  cache_lru_read,                                                              0,                   0
  cache_lru_read_partial,                                                      0,                   0
  reader_compute_cell_ranges,                                                  0,                   0
  reader_compute_dense_cell_ranges,                                            0,                   0
  reader_compute_dense_overlapping_tiles_and_cell_ranges,                      0,                   0
  reader_compute_overlapping_coords,                                           0,                   0
  reader_compute_overlapping_tiles,                                            0,                   0
  reader_compute_tile_coordinates,                                             0,                   0
  reader_copy_fixed_cells,                                                     0,                   0
  reader_copy_var_cells,                                                       0,                   0
  reader_dedup_coords,                                                         0,                   0
  reader_dense_read,                                                           0,                   0
  reader_fill_coords,                                                          0,                   0
  reader_init_tile_fragment_dense_cell_range_iters,                            0,                   0
  reader_next_subarray_partition,                                              0,                   0
  reader_read,                                                                 0,                   0
  reader_read_all_tiles,                                                       0,                   0
  reader_sort_coords,                                                          0,                   0
  reader_sparse_read,                                                          0,                   0
  writer_check_coord_dups,                                                     1,           169219000
  writer_check_coord_dups_global,                                              0,                   0
  writer_compute_coord_dups,                                                   0,                   0
  writer_compute_coord_dups_global,                                            0,                   0
  writer_compute_coords_metadata,                                              1,           175605000
  writer_compute_write_cell_ranges,                                            0,                   0
  writer_create_fragment,                                                      1,              569000
  writer_global_write,                                                         0,                   0
  writer_init_global_write_state,                                              0,                   0
  writer_init_tile_dense_cell_range_iters,                                     0,                   0
  writer_ordered_write,                                                        0,                   0
  writer_prepare_full_tiles_fixed,                                             0,                   0
  writer_prepare_full_tiles_var,                                               0,                   0
  writer_prepare_tiles_fixed,                                                  2,          1878252000
  writer_prepare_tiles_ordered,                                                0,                   0
  writer_prepare_tiles_var,                                                    0,                   0
  writer_sort_coords,                                                          1,           278904000
  writer_unordered_write,                                                      1,         13153812000
  writer_write,                                                                1,         13153813000
  writer_write_tiles,                                                          2,         21547843000
  sm_array_close,                                                              0,                   0
  sm_array_open,                                                               0,                   0
  sm_read_from_cache,                                                          0,                   0
  sm_write_to_cache,                                                           0,                   0
  sm_query_submit,                                                             1,         13153859000
  tileio_read,                                                                 0,                   0
  tileio_write,                                                            53642,         20869841000
  tileio_compress_tile,                                                    53643,          1561913000
  tileio_compress_one_tile,                                                80464,          1223685000
  tileio_decompress_tile,                                                      0,                   0
  tileio_decompress_one_tile,                                                  0,                   0
  vfs_abs_path,                                                                4,               33000
  vfs_close_file,                                                              3,           629135000
  vfs_constructor,                                                             0,                   0
  vfs_create_bucket,                                                           0,                   0
  vfs_create_dir,                                                              1,              331000
  vfs_create_file,                                                             0,                   0
  vfs_destructor,                                                              0,                   0
  vfs_empty_bucket,                                                            0,                   0
  vfs_file_size,                                                               0,                   0
  vfs_filelock_lock,                                                           0,                   0
  vfs_filelock_unlock,                                                         0,                   0
  vfs_init,                                                                    0,                   0
  vfs_is_bucket,                                                               0,                   0
  vfs_is_dir,                                                                  2,              231000
  vfs_is_empty_bucket,                                                         0,                   0
  vfs_is_file,                                                                 0,                   0
  vfs_ls,                                                                      0,                   0
  vfs_move_file,                                                               0,                   0
  vfs_move_dir,                                                                0,                   0
  vfs_open_file,                                                               0,                   0
  vfs_read,                                                                    0,                   0
  vfs_remove_bucket,                                                           0,                   0
  vfs_remove_file,                                                             0,                   0
  vfs_remove_dir,                                                              0,                   0
  vfs_supports_fs,                                                             0,                   0
  vfs_sync,                                                                    0,                   0
  vfs_write,                                                               53644,         19233315000
  vfs_s3_fill_file_buffer,                                                     0,                   0
  vfs_s3_write_multipart,                                                      0,                   0

Individual counter statistics:
  Counter name                                                             Value
  ------------------------------------------------------------------------------
  cache_lru_inserts,                                                           0
  cache_lru_read_hits,                                                         0
  cache_lru_read_misses,                                                       0
  reader_num_attr_tiles_touched,                                               0
  reader_num_fixed_cell_bytes_copied,                                          0
  reader_num_fixed_cell_bytes_read,                                            0
  reader_num_var_cell_bytes_copied,                                            0
  reader_num_var_cell_bytes_read,                                              0
  writer_num_attr_tiles_written,                                           53642
  sm_contexts_created,                                                         0
  sm_query_submit_layout_col_major,                                            0
  sm_query_submit_layout_row_major,                                            0
  sm_query_submit_layout_global_order,                                         0
  sm_query_submit_layout_unordered,                                            1
  sm_query_submit_read,                                                        0
  sm_query_submit_write,                                                       1
  tileio_read_cache_hits,                                                      0
  tileio_read_num_bytes_read,                                                  0
  tileio_read_num_resulting_bytes,                                             0
  tileio_write_num_bytes_written,                                      235219267
  tileio_write_num_input_bytes,                                        661721730
  vfs_read_total_bytes,                                                        0
  vfs_write_total_bytes,                                               235219267
  vfs_read_num_parallelized,                                                   0
  vfs_posix_write_num_parallelized,                                            0
  vfs_win32_write_num_parallelized,                                            0
  vfs_s3_num_parts_written,                                                    0
  vfs_s3_write_num_parallelized,                                               0

Summary:
--------
Hardware concurrency: 8
Reads:
  Read query submits: 0
  Tile cache hit ratio: 0 / 0 
  Fixed-length tile data copy-to-read ratio: 0 / 0 bytes
  Var-length tile data copy-to-read ratio: 0 / 0 bytes
  Total tile data copy-to-read ratio: 0 / 0 bytes
  Read compression ratio: 0 / 0 bytes
Writes:
  Write query submits: 1
  Tiles written: 53642
  Write compression ratio: 661721730 / 235219267 bytes (2.8x)

And then hangs for long, followed by segfault.

Directory "spec" is attached.

  1. If remove capacity=1024 argument, the first query data = spec[9.5:10.5, :] succeeds but the second data = spec[468.5:469.5, :] still fails.

Handle negative indexing

This can all be handled on the python side, need to check for negative indexing and reverse the index domain sub array logic for start:stop and apply the step at the end of the query.

Efficient 'stream'-processing

Hi, I'm currently evaluating TileDB for our project and I'm trying to find out how to efficiently read data. So let's say we create a DenseArray like this:

    d_xy = tiledb.Dim(ctx, "xy", domain=(1, 256 * 256), tile=32, dtype="uint64")
    d_u = tiledb.Dim(ctx, "u", domain=(1, 128), tile=128, dtype="uint64")
    d_w = tiledb.Dim(ctx, "w", domain=(1, 128), tile=128, dtype="uint64")
    domain = tiledb.Domain(ctx, d_xy, d_u, d_w)
    a1 = tiledb.Attr(ctx, "a1", compressor=None, dtype="float64")
    arr = tiledb.DenseArray(
        ctx,
        array_name,
        domain=domain,
        attrs=(a1,),
        cell_order='row-major',
        tile_order='row-major'
    )

When reading from the array using the slicing interface, in slices that fit the d_xy tiling, for example arr[1:33], I noticed that a lot of time is spent copying data (I can provide a flamegraph if you are interested). So I'm trying to understand what is happening behind the scenes: in the Domain I created, the cells have a shape of (32, 128, 128), right? And are they saved linearly to disk?

I found the read_direct method, which should not involve a copy, but as it reads the whole array it won't work for huge arrays, and it won't be cache efficient. We would like to process the data in nice chunks that fit into the L3 cache, so we thought working cell-by-cell would be optimal.

Maybe using an interface like this:

iter = arr.cell_iterator(attrs=['a1'], ...)
for cell in iter:
    assert type(cell['a1']) == np.ndarray
    # ...do some work on this cell...

This way, workloads that process the whole array can be implemented such that TileDB can make sure that the reading is done efficiently. If the processing is distributed on many machines, cell_iterator would need some way to specify what partition to return cells from.

(The cell could also have some information about its 'geometry', i.e. what parts of the array it consists of)

As an alternative, maybe the read_direct interface could be augmented to allow reading only parts of the array, and err out if you cross a cell boundary. That way, TileDB users can build something like the above interface themselves.

I'm just brain dumping, so let me know if this kind of feedback is useful!

cc @uellue

Subselection on attributes

We need to expose the ability to subselect on attributes before indexing. Example API: A.attrs["a1", "a2"][10:20]

Conda Packaging

Make a conda recipe for building tiledb-py with libtiledb as a dependency, build and host binary images on Linux64, OSX, and Windows.

Unordered coordinate reads

Impacts both sparse and dense arrays:

Check if reads are ordered, -> issue ordered coordinate read
if unordered -> fallback to multiple point queries and fill in result buffer incrementally

Order is assumed to be in coordinate sorted order (low to high), reverse order is not detected

Implement iterator over tiles

Currently, user can query arrays by coordinate ranges. However, certain work-flows require processing the entire dataset/array, although not necessarily loading it fully into memory. Tiles represent natural data units for such processing; it would be extremely useful to iterate through all tiles of array (sequence of numpy.array/OrderedDict objects, like subarray()).

Would be of much additional value for inspection/debugging.

High level VFS IO interface

FileIO exists and works well for binary writes / reads / apppends. Need to implement read / write text data, add a tiledb.open() function that returns a FileIO object that wraps a VFS instance.

FileIO writes

implement an open() method for Read / Write IO objects. writes are not seekable.

setup.py --tiledb arg not handled correctly

E.g. setup.py build_ext --tiledb=/Users/tyler/tiledb/TileDB/dist/lib gives this error:

RuntimeError: Could not find given --tiledb library path(s): /Users/tyler/tiledb/TileDB/dist/lib/lib/libtiledb.dylib

But then setup.py build_ext --tiledb=/Users/tyler/tiledb/TileDB/dist gives:

RuntimeError: Could not find given --tiledb library path(s): /Users/tyler/tiledb/TileDB/dist/libtiledb.dylib

Internal objects are not freed on all error paths

We should wrap internal TileDB objects e.g. tiledb_query_t and tiledb_array_t in simple internal python classes that just implement a __dealloc__ method to call the appropriate C API free call. This will ensure that those resources are properly freed on error paths.

Clean up during/handle KeyboardInterrupts

With the upcoming ability to cancel TileDB queries, a ^C will result in cancelling the queries, but losing the error message. We should do something like wrap all query submit calls in a try/catch, e.g.:


        rc = TILEDB_OK
        try:
            with nogil:
                rc = tiledb_query_submit(ctx_ptr, query_ptr)
        except KeyboardInterrupt:
            check_error(ctx, rc)
        finally:
            PyMem_Free(attr_names_ptr)
            PyMem_Free(buffers_ptr)
            PyMem_Free(buffer_sizes_ptr)
            tiledb_query_free(ctx_ptr, &query_ptr)

Upgrade Cython dependency

Pip install failed because the __eq__ methods don't seem to be implemented the way cython expects. I see that this is a new feature starting with Cython 0.27 (I had 0.26), so the Cython version dependency in setup.py should be incremented.

running build_ext
cythoning tiledb/libtiledb.pyx to tiledb/libtiledb.cpp

Error compiling Cython file:
------------------------------------------------------------
...

    def __len__(self):
        """Returns the number of parameters (keys) held by the Config object"""
        return sum(1 for _ in self)

    def __eq__(self, object config):
   ^
------------------------------------------------------------

tiledb/libtiledb.pyx:313:4: Special method __eq__ must be implemented via __richcmp__

Error compiling Cython file:
------------------------------------------------------------
...
            tiledb_attribute_free(&attr_ptr)
            _raise_ctx_err(ctx.ptr, rc)
        self.ctx = ctx
        self.ptr = attr_ptr

    def __eq__(self, other):
   ^
------------------------------------------------------------

tiledb/libtiledb.pyx:865:4: Special method __eq__ must be implemented via __richcmp__

Error compiling Cython file:
------------------------------------------------------------
...
            .format(self.name, self.domain, self.tile, self.dtype)

    def __len__(self):
        return self.size

    def __eq__(self, other):
   ^
------------------------------------------------------------

tiledb/libtiledb.pyx:1069:4: Special method __eq__ must be implemented via __richcmp__

Error compiling Cython file:
------------------------------------------------------------
...

    def __iter__(self):
        """Returns a generator object that iterates over the domain's dimension objects"""
        return (self.dim(i) for i in range(self.ndim))

    def __eq__(self, other):
   ^
------------------------------------------------------------

tiledb/libtiledb.pyx:1276:4: Special method __eq__ must be implemented via __richcmp__

Error compiling Cython file:
------------------------------------------------------------
...
            tiledb_kv_schema_free(&schema_ptr)
            check_error(ctx, rc)
        self.ctx = ctx
        self.ptr = schema_ptr

    def __eq__(self, other):
   ^
------------------------------------------------------------

tiledb/libtiledb.pyx:1477:4: Special method __eq__ must be implemented via __richcmp__

Error compiling Cython file:
------------------------------------------------------------
...
            tiledb_array_schema_free(&schema_ptr)
            _raise_ctx_err(ctx.ptr, rc)
        self.ctx = ctx
        self.ptr = schema_ptr

    def __eq__(self, other):
   ^
------------------------------------------------------------

tiledb/libtiledb.pyx:2165:4: Special method __eq__ must be implemented via __richcmp__
building 'tiledb.libtiledb' extension
creating build
creating build/temp.linux-x86_64-3.6
creating build/temp.linux-x86_64-3.6/tiledb
gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/nezar/local/lib/python3/tiledb-py/lib/include -I/home/nezar/miniconda3/envs/py36/lib/python3.6/site-packages/numpy/core/include -I/home/nezar/miniconda3/envs/py36/include/python3.6m -c tiledb/libtiledb.cpp -o build/temp.linux-x86_64-3.6/tiledb/libtiledb.o -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
tiledb/libtiledb.cpp:1:2: error: #error Do not use this file, it is the result of a failed Cython compilation.
 #error Do not use this file, it is the result of a failed Cython compilation.
  ^
error: command 'gcc' failed with exit status 1

SparseArray read order affects correctness

This is a weird one: it seems that in some cases, after opening a SparseArray in read-only mode, performing a query of an empty subset of the domains will cause all future queries (even of non-empty subsets) to appear empty. Performing queries of empty domain subsets does not affect correctness if they are preceded by a query of a non-empty domain subset.

My current workaround is just to perform a dummy query of the entire array (_ = arr[:, :]) every time I open one, and this seems to fix the issue, but it shouldn't be necessary.

I'm filing the bug here because I noticed it using the TileDB Python interface, but it's possible it affects other language bindings too, I haven't checked.

I've put a full set of data files and example code to reproduce here: http://mitra.stanford.edu/kundaje/cprobert/tiledb_bug/

In particular you might want to check out the html version of the notebook (since it's easier to view without downloading): http://mitra.stanford.edu/kundaje/cprobert/tiledb_bug/tiledb_bug.html

I've also copied the contents of the notebook and output below. These results were generated with the TileDB-Py v0.2.1 pre-release, but I've been able to reproduce with the v0.2.0 release too. I'm using Python 3.6.6, and NumPy 1.15.0, but have also seen this bug in a Python 3.5 environment and with different NumPy versions. The platform is an Intel Xeon server running Ubuntu 16.04.5 LTS.

import tempfile
import os

import numpy as np
import pandas as ps

import tiledb


n_idxs = np.load('x_coords.npy')
m_idxs = np.load('y_coords.npy')
values = np.load('values.npy')

# Check the (x, y) coordinate pairs are unique
df = ps.DataFrame({'x': n_idxs, 'y': m_idxs})
df = ps.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns).drop_duplicates()
assert df.shape[0] == n_idxs.shape[0]

num_entries = n_idxs.shape[0]
values_sum = values.sum()


ctx = tiledb.Ctx()

n = 249250621
m = 400

n_tile_extent = 50000

d1 = tiledb.Dim(ctx, "ndom", domain=(0, n - 1), tile=n_tile_extent, dtype="uint32")
d2 = tiledb.Dim(ctx, "mdom", domain=(0, m - 1), tile=m, dtype="uint32")

domain = tiledb.Domain(ctx, d1, d2)

v = tiledb.Attr(ctx, "v", compressor=("lz4", -1), dtype="uint8")

schema = tiledb.ArraySchema(
    ctx,
    domain=domain,
    attrs=(v,),
    capacity=10000,
    cell_order="row-major",
    tile_order="row-major",
    sparse=True,
)

with tempfile.TemporaryDirectory() as tdir:

    path = os.path.join(tdir, "arr.tiledb")

    tiledb.SparseArray.create(path, schema)

    with tiledb.SparseArray(ctx, path, mode="w") as A:
        A[n_idxs, m_idxs] = values
        
    print('\n>> 1: Reading empty query first blocks subsequent queries of non-empty cells:\n')
    
    with tiledb.SparseArray(ctx, path, mode="r") as A:
        n_ent = A[0, 0]['v'].shape[0]
        print('reading empty cell: {} entries (expected {})'.format(n_ent, 0))
        
        n_ent = A[:, :]['v'].shape[0]
        print('reading whole array: {} entries (expected {})'.format(n_ent, num_entries))
        
        n_ent = A[0, 0]['v'].shape[0]
        print('reading empty cell: {} entries (expected {})'.format(n_ent, 0))
        
        n_ent = A[:, :]['v'].shape[0]
        print('reading whole array: {} entries (expected {})'.format(n_ent, num_entries))
    
    
    print('\n>> 2: Reading non-empty query first allows subsequent queries of non-empty cells:\n')
    
    with tiledb.SparseArray(ctx, path, mode="r") as A:
        n_ent = A[:, :]['v'].shape[0]
        print('reading whole array: {} entries (expected {})'.format(n_ent, num_entries))
        
        n_ent = A[0, 0]['v'].shape[0]
        print('reading empty cell: {} entries (expected {})'.format(n_ent, 0))
        
        n_ent = A[:, :]['v'].shape[0]
        print('reading whole array: {} entries (expected {})'.format(n_ent, num_entries))
        
        n_ent = A[0, 0]['v'].shape[0]
        print('reading empty cell: {} entries (expected {})'.format(n_ent, 0))

outputs:

>> 1: Reading empty query first blocks subsequent queries of non-empty cells:

reading empty cell: 0 entries (expected 0)
reading whole array: 0 entries (expected 54696022)
reading empty cell: 0 entries (expected 0)
reading whole array: 0 entries (expected 54696022)

>> 2: Reading non-empty query first allows subsequent queries of non-empty cells:

reading whole array: 54696022 entries (expected 54696022)
reading empty cell: 0 entries (expected 0)
reading whole array: 54696022 entries (expected 54696022)
reading empty cell: 0 entries (expected 0)

Async query support

Integrate with Py3 future / async support. Will have to figure out an interface that is also usable on python2.

Handle multidimensional dense empty results

Currently the following dense tests fail, need to check this edge case where we should not issue a query to tiledb libraryl (a sub array cannot be constructed).

Ex.

In [48]: A[3:2]
Out[48]: array([], dtype=int64)
``

`wheel` prerequisite is not installed via `pip install`

When installing via pip install tiledb, if the wheel package is not already present on the user's system, the native TileDB library will be built but not properly copied into the site-packages location. A runtime error about not finding libtiledb.so will result.

Temporary fix is to do pip install wheel && pip install tiledb. The full fix I believe is to add wheel to install_requires in setup.py.

Variable Length Type support

varlen TileDB types are not currently supported. Output should be a numpy object array, which each object cell being a string or varlen numpy array.

better pickling support

Need to add the directive #cython: auto_pickle=False

In general we should be able to pickle most object representations by reloading the URI of that resource.

cc. @anh

Read & Write mode ('rw')

When prototyping, it is annoying to explicitly open the array for reading and writing. Have a rw mode which opens / closes the array for every read / write / metadata access.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.