Code Monkey home page Code Monkey logo

ndindex's Introduction

ndindex

ndindex logo

A Python library for manipulating indices of ndarrays.

The documentation for ndindex can be found at https://quansight-labs.github.io/ndindex/

ndindex is a library that allows representing and manipulating objects that can be valid indices to numpy arrays, i.e., slices, integers, ellipses, None, integer and boolean arrays, and tuples thereof. The goals of the library are

  • Provide a uniform API to manipulate these objects. Unlike the standard index objects themselves like slice, int, and tuple, which do not share any methods in common related to being indices, ndindex classes can all be manipulated uniformly. For example, idx.args always gives the arguments used to construct idx.

  • Give 100% correct semantics as defined by numpy's ndarray. This means that ndindex will not make a transformation on an index object unless it is correct for all possible input array shapes. The only exception to this rule is that ndindex assumes that any given index will not raise IndexError (for instance, from an out of bounds integer index or from too few dimensions). For those operations where the array shape is known, there is a reduce() method to reduce an index to a simpler index that is equivalent for the given shape.

  • Enable useful transformation and manipulation functions on index objects.

Examples

Canonicalize a slice (over a given shape, or independent of array shape)

>>> from ndindex import *
>>> Slice(-2, 10, 3).reduce()
Slice(-2, 10, 2)
>>> Slice(-2, 10, 3).reduce(5)
Slice(3, 4, 1)

Compute the maximum length of a sliced axis

>>> import numpy as np
>>> len(Slice(2, 10, 3))
3
>>> len(np.arange(10)[2:10:3])
3

Compute the shape of an array of shape (10, 20) indexed by [0, 0:10]

>>> Tuple(0, slice(0, 10)).newshape((10, 20))
(10,)
>>> np.ones((10, 20))[0, 0:10].shape
(10,)

Check if an indexed array would be empty

>>> Tuple(0, ..., Slice(10, 20)).isempty((3, 4, 5))
True
>>> np.ones((3, 4, 5))[0,...,10:20]
array([], shape=(4, 0), dtype=float64)

See the documentation for full details on what ndindex can do.

License

MIT License

Acknowledgments

ndindex development is supported by Quansight Labs and is sponsored in part by the D. E. Shaw group. The D. E. Shaw group collaborates with Quansight on numerous open source projects, including Numba, Dask and Project Jupyter.

https://labs.quansight.org/ https://www.deshaw.com

ndindex's People

Contributors

asmeurer avatar ilanschnell avatar keszybz avatar melissawm avatar rgommers avatar ruancomelli avatar scopatz avatar synapticarbors avatar telamonian avatar zac-hd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ndindex's Issues

Old links are broken

Apparently GitHub doesn't redirect GitHub pages, so all the old links from quansight.github.io/ndindex are broken. I thought they were redirecting, but they aren't now, so I'm not sure what's up with that.

This page https://shoehornwithteeth.com/ramblings/2016/12/redirecting-github-pages-after-renaming-a-repository/ suggests we can use a Jekyll plugin to fix the old redirects. The downside is this means we have to recreate the repository, which will break redirects on GitHub itself.

set-like operations

It would also be useful to have various set-like operations, treating the elements indexed by an index as a set

in meaning subset
+ and & meaning union
| meaning intersection
- meaning subtraction (idx1 - idx2 are the elements indexed by idx1 but not idx2)

For example, idx1 in idx2 would return True if all the elements indexed by idx1 are also indexed by idx2. idx1 + idx2 would give an index representing the elements indexed by either idx1 or idx2.

There will be lots of cases where they have to raise exceptions obviously because it can't be computed. I don't know if the canonical form after reduce(shape) will always avoid that, or if we would also need method variants that can include a shape. There will also be cases where the index can't be represented except by an array index.

Another issue is that for integer array indices, the same element can be indexed multiple times, so it doesn't completely make sense to treat it as a set.

iter() should iterate single element indices

For example, list(Slice(0, 3)) would return [0, 1, 2]. list(Tuple(Slice(0, 1), Slice(0, 1)) would return [(0, 0), (0, 1), (1, 0), (1, 1)]. It would raise ValueError on a Tuple with an ellipsis.

An iter would be infinite if the length of a slice is unbounded, but a shape reduced index should always give a finite iteration of exactly prod(shape) items.

isindex function

It would be useful to have a function that checks if an object is an index type. It should have a flag to include or not include NDIndex types.

v1.6: test_integerarray_reduce_hypothesis falsifying example

Version 1.6 fails the test suite, despite PRs #133 and #147 applied:

[   86s] ============================= test session starts ==============================
[   86s] platform linux -- Python 3.8.16, pytest-7.2.0, pluggy-1.0.0 -- /usr/bin/python3.8
[   86s] cachedir: .pytest_cache
[   86s] hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/home/abuild/rpmbuild/BUILD/ndindex-1.6/.hypothesis/examples')
[   86s] rootdir: /home/abuild/rpmbuild/BUILD/ndindex-1.6
[   86s] plugins: hypothesis-6.61.2
[   87s] collecting ... collected 105 items
...
[  240s] _____________________ test_integerarray_reduce_hypothesis ______________________
[  240s] 
[  240s] a = array([], shape=(1, 0), dtype=int64), idx = array([2, 0])
[  240s] raw_func = <function <lambda> at 0x7f2b91ac30d0>
[  240s] ndindex_func = <function test_integerarray_reduce_hypothesis.<locals>.<lambda> at 0x7f2b91b063a0>
[  240s] same_exception = True, assert_equal = <function assert_equal at 0x7f2b91ac3040>
[  240s] 
[  240s]     def check_same(a, idx, raw_func=lambda a, idx: a[idx],
[  240s]                    ndindex_func=lambda a, index: a[index.raw],
[  240s]                    same_exception=True, assert_equal=assert_equal):
[  240s]         """
[  240s]         Check that a raw index idx produces the same result on an array a before
[  240s]         and after being transformed by ndindex.
[  240s]     
[  240s]         Tests that raw_func(a, idx) == ndindex_func(a, ndindex(idx)) or that they
[  240s]         raise the same exception. If same_exception=False, it will still check
[  240s]         that they both raise an exception, but will not require the exception type
[  240s]         and message to be the same.
[  240s]     
[  240s]         By default, raw_func(a, idx) is a[idx] and ndindex_func(a, index) is
[  240s]         a[index.raw].
[  240s]     
[  240s]         The assert_equal argument changes the function used to test equality. By
[  240s]         default it is the custom assert_equal() function in this file that extends
[  240s]         numpy.testing.assert_equal. If the func functions return something other
[  240s]         than arrays, assert_equal should be set to something else, like
[  240s]     
[  240s]             def assert_equal(x, y):
[  240s]                 assert x == y
[  240s]     
[  240s]         """
[  240s]         exception = None
[  240s]         try:
[  240s]             # Handle list indices that NumPy treats as tuple indices with a
[  240s]             # deprecation warning. We want to test against the post-deprecation
[  240s]             # behavior.
[  240s]             e_inner = None
[  240s]             try:
[  240s]                 try:
[  240s]                     a_raw = raw_func(a, idx)
[  240s]                 except Warning as w:
[  240s]                     # In NumPy < 1.23, this is a FutureWarning. In 1.23 the
[  240s]                     # deprecation was removed and lists are always interpreted as
[  240s]                     # array indices.
[  240s]                     if ("Using a non-tuple sequence for multidimensional indexing is deprecated" in w.args[0]): # pragma: no cover
[  240s]                         idx = array(idx)
[  240s]                         a_raw = raw_func(a, idx)
[  240s]                     elif "Out of bound index found. This was previously ignored when the indexing result contained no elements. In the future the index error will be raised. This error occurs either due to an empty slice, or if an array has zero elements even before indexing." in w.args[0]:
[  240s]                         same_exception = False
[  240s]                         raise IndexError
[  240s]                     else: # pragma: no cover
[  240s]                         fail(f"Unexpected warning raised: {w}")
[  240s]             except Exception:
[  240s]                 _, e_inner, _ = sys.exc_info()
[  240s]             if e_inner:
[  240s]                 raise e_inner
[  240s]         except Exception as e:
[  240s]             exception = e
[  240s]     
[  240s]         try:
[  240s]             index = ndindex(idx)
[  240s] >           a_ndindex = ndindex_func(a, index)
[  240s] 
[  240s] ndindex/tests/helpers.py:238: 
[  240s] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[  240s] 
[  240s] a = array([], shape=(1, 0), dtype=int64), x = IntegerArray([2, 0])
[  240s] 
[  240s] >   check_same(a, index.raw, ndindex_func=lambda a, x: a[x.reduce(shape).raw])
[  240s] 
[  240s] ndindex/tests/test_integerarray.py:65: 
[  240s] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[  240s] 
[  240s] self = IntegerArray([2, 0]), shape = (1, 0), axis = 0
[  240s] 
[  240s]     def reduce(self, shape=None, axis=0):
[  240s]         """
[  240s]         Reduce an `IntegerArray` index on an array of shape `shape`.
[  240s]     
[  240s]         The result will either be `IndexError` if the index is invalid for the
[  240s]         given shape, an `IntegerArray` index where the values are all
[  240s]         nonnegative, or, if `self` is a scalar array index (`self.shape ==
[  240s]         ()`), an `Integer` whose value is nonnegative.
[  240s]     
[  240s]         >>> from ndindex import IntegerArray
[  240s]         >>> idx = IntegerArray([-5, 2])
[  240s]         >>> idx.reduce((3,))
[  240s]         Traceback (most recent call last):
[  240s]         ...
[  240s]         IndexError: index -5 is out of bounds for axis 0 with size 3
[  240s]         >>> idx.reduce((9,))
[  240s]         IntegerArray([4, 2])
[  240s]     
[  240s]         See Also
[  240s]         ========
[  240s]     
[  240s]         .NDIndex.reduce
[  240s]         .Tuple.reduce
[  240s]         .Slice.reduce
[  240s]         .ellipsis.reduce
[  240s]         .Newaxis.reduce
[  240s]         .Integer.reduce
[  240s]         .BooleanArray.reduce
[  240s]     
[  240s]         """
[  240s]         if self.shape == ():
[  240s]             return Integer(self.array).reduce(shape, axis=axis)
[  240s]     
[  240s]         if shape is None:
[  240s]             return self
[  240s]     
[  240s]         shape = asshape(shape, axis=axis)
[  240s]     
[  240s]         size = shape[axis]
[  240s]         new_array = self.array.copy()
[  240s]         out_of_bounds = (new_array >= size) | ((-size > new_array) & (new_array < 0))
[  240s]         if out_of_bounds.any():
[  240s] >           raise IndexError(f"index {new_array[out_of_bounds].flat[0]} is out of bounds for axis {axis} with size {size}")
[  240s] E           IndexError: index 2 is out of bounds for axis 0 with size 1
[  240s] 
[  240s] ndindex/integerarray.py:96: IndexError
[  240s] 
[  240s] During handling of the above exception, another exception occurred:
[  240s] 
[  240s]     @example(array([2, 0]), (1, 0))
[  240s] >   @example(array(0), 1)
[  240s] 
[  240s] ndindex/tests/test_integerarray.py:54: 
[  240s] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[  240s] /usr/lib/python3.8/site-packages/hypothesis/core.py:979: in _raise_to_user
[  240s]     raise the_error_hypothesis_found
[  240s] ndindex/tests/test_integerarray.py:65: in test_integerarray_reduce_hypothesis
[  240s]     check_same(a, index.raw, ndindex_func=lambda a, x: a[x.reduce(shape).raw])
[  240s] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[  240s] 
[  240s] a = array([], shape=(1, 0), dtype=int64), idx = array([2, 0])
[  240s] raw_func = <function <lambda> at 0x7f2b91ac30d0>
[  240s] ndindex_func = <function test_integerarray_reduce_hypothesis.<locals>.<lambda> at 0x7f2b91b063a0>
[  240s] same_exception = True, assert_equal = <function assert_equal at 0x7f2b91ac3040>
[  240s] 
[  240s]     def check_same(a, idx, raw_func=lambda a, idx: a[idx],
[  240s]                    ndindex_func=lambda a, index: a[index.raw],
[  240s]                    same_exception=True, assert_equal=assert_equal):
[  240s]         """
[  240s]         Check that a raw index idx produces the same result on an array a before
[  240s]         and after being transformed by ndindex.
[  240s]     
[  240s]         Tests that raw_func(a, idx) == ndindex_func(a, ndindex(idx)) or that they
[  240s]         raise the same exception. If same_exception=False, it will still check
[  240s]         that they both raise an exception, but will not require the exception type
[  240s]         and message to be the same.
[  240s]     
[  240s]         By default, raw_func(a, idx) is a[idx] and ndindex_func(a, index) is
[  240s]         a[index.raw].
[  240s]     
[  240s]         The assert_equal argument changes the function used to test equality. By
[  240s]         default it is the custom assert_equal() function in this file that extends
[  240s]         numpy.testing.assert_equal. If the func functions return something other
[  240s]         than arrays, assert_equal should be set to something else, like
[  240s]     
[  240s]             def assert_equal(x, y):
[  240s]                 assert x == y
[  240s]     
[  240s]         """
[  240s]         exception = None
[  240s]         try:
[  240s]             # Handle list indices that NumPy treats as tuple indices with a
[  240s]             # deprecation warning. We want to test against the post-deprecation
[  240s]             # behavior.
[  240s]             e_inner = None
[  240s]             try:
[  240s]                 try:
[  240s]                     a_raw = raw_func(a, idx)
[  240s]                 except Warning as w:
[  240s]                     # In NumPy < 1.23, this is a FutureWarning. In 1.23 the
[  240s]                     # deprecation was removed and lists are always interpreted as
[  240s]                     # array indices.
[  240s]                     if ("Using a non-tuple sequence for multidimensional indexing is deprecated" in w.args[0]): # pragma: no cover
[  240s]                         idx = array(idx)
[  240s]                         a_raw = raw_func(a, idx)
[  240s]                     elif "Out of bound index found. This was previously ignored when the indexing result contained no elements. In the future the index error will be raised. This error occurs either due to an empty slice, or if an array has zero elements even before indexing." in w.args[0]:
[  240s]                         same_exception = False
[  240s]                         raise IndexError
[  240s]                     else: # pragma: no cover
[  240s]                         fail(f"Unexpected warning raised: {w}")
[  240s]             except Exception:
[  240s]                 _, e_inner, _ = sys.exc_info()
[  240s]             if e_inner:
[  240s]                 raise e_inner
[  240s]         except Exception as e:
[  240s]             exception = e
[  240s]     
[  240s]         try:
[  240s]             index = ndindex(idx)
[  240s]             a_ndindex = ndindex_func(a, index)
[  240s]         except Exception as e:
[  240s]             if not exception:
[  240s] >               fail(f"Raw form does not raise but ndindex form does ({e!r}): {index})") # pragma: no cover
[  240s] E               Failed: Raw form does not raise but ndindex form does (IndexError('index 2 is out of bounds for axis 0 with size 1')): IntegerArray([2 0]))
[  240s] E               Falsifying explicit example: test_integerarray_reduce_hypothesis(
[  240s] E                   idx=array([2, 0]),
[  240s] E                   shape=(1, 0),
[  240s] E               )
[  240s] 
[  240s] ndindex/tests/helpers.py:241: Failed

Add is_view method

Need a method that returns True iff the index would produce a view. Should be quite simple to implement.

Documentation organization

Right now the bulk of the documentation is in the docstrings.

Presently, the docs have an "api.html" page that lists all the docstrings for every class in the library https://quansight.github.io/ndindex/api.html.

I also want to add some narrative documentation (i.e., in Sphinx only, not docstrings), for instance, describing different intricacies of indexing. For example, I want to write a document that explains how slice indexing works.

I am wondering what the best way to organize the documentation would be. Should we keep the API documentation separate from the above discussions? Or would it make more sense to have separate pages for each index type, which has the discussion in one section and the API documentation in another? Conversely, if we decide to keep the API docs separate, is it OK to keep it all on one page or is it better to split into separate pages for each class?

@melissawm can you give your thoughts on this?

Rename ndindex.py

It's bad to have the same submodule name as a name in the submodule. We should rename ndindex.py to something like common.py. ellipsis.py also has this problem.

Method asflat

A method idx.asflat(shape) would return an index j such that a[idx].flatten() == a.flatten()[j].

More generally, we could have a method that reindexes into a reshaped array.

One thing I'm not sure about is how generally this can work. How generally is a flattened index of a simple index still a simple index (i.e., we would not require IntegerIndex, which is not yet implemented)?

Distinction between rank-0 arrays and scalars

There are some instances where reduce currently transforms one index into another where they don't give the same thing on scalars or rank-0 arrays. For example

>>> np.int64(0)[...]
array(0)
>>> np.int64(0)[()]
0
>>> np.array(0)[...]
array(0)
>>> np.int64(0)[...]
array(0)

Currently ellipsis().reduce() gives Tuple().

>>> np.array([1, 2, 3])[0,]
1
>>> np.array([1, 2, 3])[0,...]
array(1)

Currently the code assumes that tuples always implicitly end in an ellipsis (Tuple.reduce removes any trailing ellipsis) .

I don't know if we should care about this distinction.

A way to serialize/deserialize ndindex

Hi all. This seems like a really neat project, and one I'll probably get a lot of use out of. Question: would a way to deserialize/serialize an ndindex be something that's within the scope of this project?

Here's my use case -> jupyterlab/jupyterlab-hdf5#4. Short version:

  • A user opens an hdf5 dataset in their browser, then specify which 1D or 2D slice of a hyperslab to view using numpy-style slice notation
  • The jupyterlab-hdf5 frontend ships said slice to its backend in the form of a string
  • On the backend, I rehydrate/deserialize/whatever the slice string and then hand it to h5py, which also uses numpy-style slices to specify hyperslabs

I'd be more than willing to pitch in and help with this in the form of code/PRs, but I'll probably need some help figuring out how to handle deserialization.

Slice.reduce should give a completely canonical result

What that means is that any two slices that are always identical should always reduce to the same slice. The main things I know are missing are

  • Empty slices (e.g., Slice(4, 3).reduce() currently returns the slice unchanged)
  • Unnecessary steps, like Slice(1, 2, 10).

We can test this by keeping track of which slices as mappings in an exhaustive test.

hash(ndindex_obj) != hash(raw_obj)

ndindex objects are equal to the equivalent raw objects. But they define a separate hash, which can cause issues if the raw object is also hashable:

>>> hash(Tuple(1, ...))
3933283291967008864
>>> hash((1, ...))
3712794302832630381
>>> Tuple(1, ...) == (1, ...)
True

In general it would not be a good idea to use a raw object in a hashable context, because only a subset of raw indices are hashable (slices and arrays are not hashable), but it would still be a good idea to make the hashes match when they exist (see https://www.asmeurer.com/blog/posts/what-happens-when-you-mess-with-hashing-in-python/).

Numba support

It's not something I need, but it would be useful to support numba in ndindex.

Numba support means two things:

  1. (Optionally) @njit all relevant ndindex methods, so that they can run fast.
  2. Register the ndindex types with numba so that they themselves can be used inside of an @njitted function.

I expect both of these would be challenging, but I also imagine this could be very useful for someone who wants to use ndindex in contexts where performance matters.

Remove duplication in tests

I'm not sure how it should be properly fixed yet, but there is currently a lot of duplication in the test logic, especially for different iterations of same test across different types.

More concise reprs for Slice

Currently Slice always prints all three arguments, even if only one or two are needed to create it:

>>> Slice(None)
Slice(None, None, None)
>>> Slice(0, 10)
Slice(0, 10, None)

these could be printed more concisely as Slice(None) and Slice(0, 10).

We should also fix the printing of slice in Tuples so that something like Tuple(slice(None), slice(0, 10)) prints as such instead of Tuple(slice(None, None, None), slice(0, 10, None)).

(note, printing slices using : notation is a separate issue. This would be useful, but is only valid inside of an index expression so it should be done in a separate method)

Factor out common logic in API functions

A lot of API functions have common logic at the top of each implementation. For example, calling ndindex() on the input arguments, or type checking that is the same in every case.

We should refactor the common logic to the superclass, and have the methods call implementations in the subclasses. Something like

class NDIndex:
    def method(self, arg):
       # common logic goes here

       return self._method(arg)

    def _method(self, arg):
        raise NotImplementedError
class Subclass(NDIndex):
    def _method(self, arg):
        # Subclass specific implementation

The only downside of this is that it makes it impossible to have subclass specific docstrings. Presently reduce and expand are the only methods with subclass specific docstrings, and I don't think they have any common logic, so they can be left alone. For other methods, we could have some way to automatically replace the docstrings with docstrings on the implementation methods.

ndarray(x) == NDIndex(x) and NDIndex(x) == ndarray(x) should return a consistent result

Currently, testing the equality of NDIndex vs ndarray objects changes depending on the order of the objects:

>>> ndindex([0, 1]) == np.array([0, 1])
True
>>> np.array([0, 1]) == ndindex([0, 1])
array([False, False])

Ideally, the result should not change in this case, ie

a = (ndindex([0, 1]) == np.array([0, 1]))
b = (np.array([0, 1]) == ndindex([0, 1]))
np.all(a == b) == True

Reduce array indices to slices

I'm not sure if it's a good idea, but we could make reduce give a slice for an array index if it is equal. It isn't the same for arrays in tuples, so some care is required there. This also would change the index from a fancy to non-fancy index (the latter creates a view in NumPy), so I'm also not sure if it's actually a good idea.

Hypothesis not finding failing examples

At #76 (particularly the most recent commit, 2006911), I implemented the logic for as_subindex wrong. But the hypothesis test for as_subindex just isn't finding a failing example. I even ran it with --max-examples=1000000, which took 2 hours to run, and nothing.

I tried reducing the broad ndindices strategies to strategies that I knew would encompass the bug. I got nothing until I narrowed it down to just

shapes = tuples(integers(0, 10)).filter(
             # numpy gives errors with empty arrays with large shapes.
             # See https://github.com/numpy/numpy/issues/15753
             lambda shape: prod([i for i in shape if i]) < 1000)

_integer_arrays = arrays(intp, shapes, elements=integers(0, 10))

Tuples_of_slices = tuples(one_of(
    slices(),
), min_size=2).filter(_doesnt_raise)

Tuples_of_integerarrays = tuples(one_of(
    _integer_arrays,
), min_size=2, max_size=32).filter(_doesnt_raise)

@given(Tuples_of_slices, Tuples_of_integerarrays, shapes)
def test_as_subindex_hypothesis(idx1, idx2, shape):
    ...

(compare that to the default ndindices and shapes strategies https://github.com/asmeurer/ndindex/blob/2006911d291f5af0681b64bd9ba6b96eed5d5eec/ndindex/tests/helpers.py). And even then I still had to run with --max-examples=1000.

A shrunk example that fails is

    idx1=(slice(None, 1, None), slice(None, 1, None))
    idx2=(array([0]), array([0]))
    shape=(1, 1)

@Zac-HD do you have any suggestions on how I can improve the situation here? I understand that the search space for my strategies is pretty big, but I would expect to at least get something useful out of it after running for a few hours. By the way, if you want to run the above test, you may have to install NumPy from git (sorry).

Is there maybe some way to tell hypothesis that certain things are more important than others? When I look at the verbose output, it seems to be trying a lot of things that aren't really going to make a difference for the most part, like very large integers. It also really likes producing empty tuples, and those are good to test, but they generally represent trivial cases and in a lot of instances I if an empty tuple is a problem I would expect a larger example to shrink to it. I'm OK with it testing these things, but watching the verbose output, it feels like it care about them a little bit too much. Or should I not focus too much on what hypothesis is doing internally?

I know I can add @examples, but that seems a lot like I would just be fooling myself into thinking hypothesis is useful, because the @examples don't actually influence the example generation from the strategies (they also don't shrink, which is kind of annoying). So an @example might cover a case I know to be wrong, but hypothesis wouldn't actually ever find cases that I don't know about.

Run the doctests with pytest

Right now the doctests are run with a separate script, ./run_doctests. This is done because we have to patch the doctest runner to get the behavior we want.

It would be nice to get it so that pytest runs the doctests. Possible ways to do this:

  1. Hook into the pytest runner in run_doctests so that the collecter sees them as tests and automatically runs them. I don't know how hard this is to do.
  2. Upstream the improvements to pytest itself. See pytest-dev/pytest#6978 and pytest-dev/pytest#7374. This is the preferred option, but the challenge is getting pytest to accept the changes.
  3. Make a custom test that runs run_doctest. The issue here is how to make it show the output nicely, especially when there are failures. I don't know if there is a way to do it, other than to basically do option 1.

Logo

ndindex needs a logo.

Use multipledispatch

Functions like as_subindex, and in the future things from #14 will have different logic for every combination of two argument types. It might be helpful to organize the code using multipledispatch.

The actual dispatching isn't necessary. It would just help with organization. So maybe in fact we can just roll our own simple variant, since we don't really need to handle subclasses or collisions.

Consider opening a NEP to allow overrride of ndarray.__getitem__ dispatch

When I first picked up and tried to start using ndindex to do cool things, the biggest pain point/stumbling block for me was the fact that you can't do

arr[ix]

and instead have to do

arr[ix.raw]

So let's change numpy to make the nicer syntax a reality! There's already one NEP under consideration that will make signficant improvements/simplifications to how complex indexing works:

https://numpy.org/neps/nep-0021-advanced-indexing.html

We could maybe include a proposal for improving the __getitem__ machinery (making it more flexible/overridable wrt the object being used as an index) as part of the ongoing implementation work for NEP 21, or possibly as a new NEP.

Custom None for Slice args?

A big issue with Slice is that any of the three arguments can be None. So any code that does a numeric check like s.stop < 0 must first check if it is None or it will get a TypeError. In some cases this can be avoided by calling reduce() first, but it can't be avoided in general, since reduce() without a shape cannot remove None from start and stop in all cases.

Instead of requiring all code that deals with Slice to be defensive, I wonder if it would be prudent to have a custom none type that is used in Slice args that has

class none:
    def __lt__(self, other):
        return False
    def __gt__(self, other):
        return False
    def __le__(self, other):
        return False
    def __ge__(self, other):
        return False
    def __eq__(self, other):
        return other == None or isinstance(other, None)

With this, if s.stop < 0 will do the right thing without having to have a guard like if s.stop is not None and s.stop >= 0, and prevents the temptation to write incorrect logic like if s.stop and s.stop >= 0.

The downside is that this will break code that does do is None, and there would be no way to avoid it since it's impossible to override is (we have the same problem with ellipsis(), basically is is terrible).

So I'm not sure if it's a good idea or not. For ndindex library code we can get away with requiring defensive coding because it is so thoroughly tested that any mistake would be caught. But this is also an issue for end-user code as well.

Improving Hypothesis' index and shape strategies

๐Ÿ‘‹ I'm a Hypothesis core dev, and got very excited when I saw your intro blog post today.

I've been working to support better property-based testing for science code for some time now - see e.g. scipy-conference/scipy_proceedings#549 - and it's lovely to find another project and potential collaboration!

https://github.com/Quansight/ndindex/blob/0d9d131a5310915f3a51ebe95491e43afcd1dcad/ndindex/tests/helpers.py#L14-L18

In particular, both of these limitations are things that we can fix!

While I can total understand and support rolling your own rather than working upstream, it would be awesome if you could open an issue when you run into things like this - one of our limiting factors is actually user input and use-cases, so if you have suggestions they will find interested ears ๐Ÿ˜

Before I go and upgrade our slices() and basic_indices() strategies, can I just check:

  • The "empty selection only if start == end" aspect of slices() is OK? If so I can just add some logic to maybe-invert index-from-start to index-from-end.
  • basic_indices() could similarly maybe-return t[0] if it would return a length-one tuple t, unless disabled by setting a new argument to False. Do you think this would match Numpy user's intuitions about what a basic index is?

Testing mode for hypothesis explain phase

Hypothesis has a new feature that guess what line of code caused a failure https://hypothesis.readthedocs.io/en/latest/settings.html#phases

We should add some option to enable it in the tests. Currently it is disabled by default. But even if you enable it with --hypothesis-explain, it doesn't work, because coverage has to be disabled. We should also disable the explicit examples, because they prevent it from working (which maybe should be considered a hypothesis bug).

Function to determine what chunk(s) an index falls in

Now that we have as_subindex() to reindex an index onto a chunk index, we also need a function to determine which chunks to reindex onto. The naive way is to look at every chunk, but if many chunks do not contain an index, this is very inefficient.

For a single dimension with slices, in versioned-hdf5, we do

for i in range(math.floor(start/chunk), math.ceil(stop/chunk)):

(see deshaw/versioned-hdf5#72). But this logic is more complicated in higher dimensions, so I think we need something in ndindex that does it.

I will need to think about what the right abstraction here is. The naive model could be more efficient if we had a function to check if an index is empty (more efficient in the sense that we would avoid doing array computations on empty indices; it would still be inefficient from the ndindex point of view).

A less general way, but one that can perhaps be implemented more efficiently, would be to specify the chunks as a chunk rather than a list of chunk indices (like h5py syntax, (100, 200, None) would mean a chunk size of 100 in the first dimension, 200 in the second, and no chunking in the third).

Symbolic arguments

It would be useful to allow symbolic arguments to ndindex functions. The best way to do it is to allow the inputs to be SymPy expressions. This would require rewriting all the functions so that they can return SymPy expressions. In particular, functions would either have to be nested somehow, or else return sympy.Piecewise objects. Some functions use more advanced things like the Chinese Remainder Theorem, so it isn't clear to me how to represent those completely symbolically.

Aside from allowing people to do calculations on expressions, one advantage of doing this is that it would provide a way for people to generate functions for indexed operations that do not depend on ndindex. The idea would be symbolic input -> ndindex -> sympy expression -> codegen to Python.

An open question is how to handle None as an input to Slice. Clearly an expression like x + 1 cannot be None, but should we allow a single Symbol x to be None? And if so, how do we represent that symbolically?

This is a very long term goal. I want to get things working with numeric arguments first.

ndindex 1.7 apparently introduced some glitches in wheels for Win

Something happened in ndindex 1.7 that prevents python-blosc2 package to use it straight from PyPI for Windows. The ndindex dependency used to work well with ndindex < 1.7, but recently we started to notice that GHA was crashing on Windows.

The offending error happens when trying to import a wheel made out of ndindex 1.7:

ImportError: DLL load failed while importing ndindex: %1 is not a valid Win32 application.

Apparently, the making of the wheel goes well, but for some reason, it does not import on Win anymore.

Type confusion with Tuple

There's a lot of potential for type confusion with Tuple:

  • Tuple(tuple) is currently not valid. To create a Tuple from a tuple, you have to use Tuple(*tuple). But there are valid slices with nested tuples, so we can't make it work.
  • len(t) gives ValueError because len in ndindex means the number of elements indexed by the index for a single axis. For a Tuple, this isn't implemented, but we could make it assume there are as many axes as elements of the tuple. But people might expect it to mean len(Tuple.args).
  • t[0] doesn't work, but it eventually will mean composed indices (a[t][0]), but people might expect t.args[0].

I don't know if there are good solutions here, other than documentation, unless we want to give up some of the nice syntactic sugar. It's unfortunate that tuple is an index that itself can be indexed.

split function

split(i0, [i1, i2, ...]) will return a list of indices [j1, j2, ...] such that a[i0] = concat(a[i1][j1], a[i2][j2], ...).

This is needed primarily so that we can split an index into chunks, but this seems like a good generalization.

Tuple.reduce(shape) should remove redundant slices

Right now Tuple.reduce(shape) always returns a Tuple with as many dimensions as len(shape). But perhaps it would be better if it simplified and removed redundant slices, especially at the end of the tuple.

Use CSS to layout the examples in the slices docs

The examples in the slices docs are currently made using MathJax, but it would be better to use a CSS layout instead. That would give us more control over how they look and allow using the same code font. Also, the circles are currently broken on mobile.

Test that Tuple.reduce can't remove any indices

We can't guarantee that Tuple.reduce is fully canonical like with slices #82, because there are too many combinations. But we can test that it removes any redundant terms. Something like

diff --git a/ndindex/tests/test_tuple.py b/ndindex/tests/test_tuple.py
index c8a53a4..3a02961 100644
--- a/ndindex/tests/test_tuple.py
+++ b/ndindex/tests/test_tuple.py
@@ -10,7 +10,7 @@ from pytest import raises
 from ..ndindex import ndindex
 from ..tuple import Tuple
 from ..integer import Integer
-from .helpers import check_same, Tuples, prod, short_shapes, iterslice
+from .helpers import assert_equal, check_same, Tuples, prod, short_shapes, iterslice

 def test_tuple_constructor():
     # Test things in the Tuple constructor that are not tested by the other
@@ -131,6 +131,12 @@ def test_tuple_reduce_hypothesis(t, shape):
         if isinstance(reduced, Tuple):
             assert len(reduced.args) != 1
             assert reduced == () or reduced.args[-1] != ...
+            args = reduced.args
+            for i in range(len(args)):
+                a_index = a[index.raw]
+                idx = Tuple(*(args[0:i] + args[i+1:]))
+                # Is there a better way to do assert not equal?
+                raises((IndexError, AssertionError), lambda: assert_equal(a_index, a[idx.raw]))
         # TODO: Check the other properties from the Tuple.reduce docstring.

 def test_tuple_reduce_explicit():

The problem is that this leads to some odd cases which aren't straightforward to match, for example, (..., False) and False are equivalent for shape (0,) because the added 0 to the shape from the False is the same on the left or right. Similarly for (..., None) and None for shape (1,).

Make classes completely unevaluated

RIght now, classes canonicalize in their constructors. But a better design would be to not do anything in the constructors, other than basic type checking. There should then be some sort of canonicalize method (but not called "canonicalize", maybe we can use doit like SymPy), which applies the normalization. This would also allow classes to use __init__ instead of __new__.

Any method that assumes the inputs are canonical (e.g., no None step in Slice) would call this method first.

This way, people can represent indices exactly, without them being normalized. Things like Slice(0, n) and Slice(n) could be distinguishable from one another.

Open questions:

  • What sorts of things should be considered OK for type checking in constructors? I think wrong types (like non-integer) is OK. But what about, for instance, Tuple(..., ...) (tuples can only have one ellipsis)?

  • What's the standard way to create objects if you do want canonicalization? A good way would normally be to make lowercase functions that call doit for you, but we can't use slice and tuple, and the ellipsis class will probably be called ellipsis since Ellipsis is taken.

  • Should ndindex() call doit() automatically by default? My thinking is yes. Most people want a canonical form. This issue is just for those who don't. This also would solve the above point.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.