nv-legate / cunumeric Goto Github PK
View Code? Open in Web Editor NEWAn Aspiring Drop-In Replacement for NumPy at Scale
Home Page: https://docs.nvidia.com/cunumeric/24.06/
License: Apache License 2.0
An Aspiring Drop-In Replacement for NumPy at Scale
Home Page: https://docs.nvidia.com/cunumeric/24.06/
License: Apache License 2.0
Something like legate.numpy.arange(1, 3, 5)
thows an error AttributeError: 'Future' object has no attribute 'compute_parallel_launch_space'
. (Here, start
is 1
, stop
is 3
, and step
is 5
. See the signature from vanilla NumPy.)
test.py
:
from legate import numpy as legatenumpy
import numpy as truenumpy
t_start, t_end, dt = 0, 1, 2
t_legate = legatenumpy.arange(t_start, t_end, dt)
t_numpy = truenumpy.arange(t_start, t_end, dt)
print(truenumpy.allclose(t_legate, t_numpy))
print(t_legate)
print(t_numpy)
legate --cpus 1 ./test.py -lg:numpy:test
Either
True
[0]
[0]
or NotImplementedError
, or an error message indicating this specific use case of arange
is not currently supported.
Traceback (most recent call last):
File "<prefix>/lib/python3.8/site-packages/legion_top.py", line 394, in legion_python_main
run_path(args[start], run_name='__main__')
File "<prefix>/lib/python3.8/site-packages/legion_top.py", line 193, in run_path
exec(code, module.__dict__, module.__dict__)
File "./test.py", line 6, in <module>
t_legate = legatenumpy.arange(t_start, t_end, dt)
File "<prefix>/lib/python3.8/site-packages/legate/numpy/module.py", line 61, in arange
result._thunk.arange(start, stop, step, stacklevel=(stacklevel + 1))
File "<prefix>/lib/python3.8/site-packages/legate/numpy/deferred.py", line 2338, in arange
launch_space = dst.compute_parallel_launch_space()
AttributeError: 'Future' object has no attribute 'compute_parallel_launch_space'
lg:numpy:test
.arange
might be an edge case in most of applications. (I think)-lg:numpy:test
), I guess this error will probably never be triggered because this use case creates just-one-element array and does not go with legion codepath?The current unary reduction is missing implementations for the following cases:
np.argmin
and np.argmax
with no axis
valueThe expression np.zeros((5,))[5:]
evaluates to array([], dtype=float64)
in NumPy, but causes a IndexError: index 5 is out of bounds for axis 0 with size 5
in Legate.NumPy.
When using numpy,tile
, if the input array is an array obtained from reshape
, the program crashed at Legion level, i.e., no Python error traceback.
test.py
from legate import numpy
# this works
print(numpy.tile(numpy.array([[2], [1]], dtype=numpy.float64), (1, 10)))
# this does not work
print(numpy.tile(numpy.arange(2, 0, -1, dtype=numpy.float64).reshape((2, 1)), (1, 10)))
legate --cpus 1 ./test.py -lg:numpy:test -lg:inorder
The first print correctly prints
[[2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]
However, the second print frailed because the it crashed. The error:
[0 - 7f08045c77c0] 0.807842 {5}{runtime}: [error 164] LEGION ERROR: Dynamic type mismatch in 'get_index_space_domain' (from file <prefix>/legate.core/legion/runtime/legion/region_tree.inl:3213)
For more information see:
http://legion.stanford.edu/messages/error_code.html#error_code_164
Signal 6 received by node 0, process 398141 (thread 7f08045c77c0) - obtaining backtrace
Signal 6 received by process 398141 (thread 7f08045c77c0) at: stack trace: 14 frames
[0] = /usr/lib/libpthread.so.0(+0x13960) [0x7f08045a4960]
[1] = /usr/lib/libc.so.6(gsignal+0x145) [0x7f080410aef5]
[2] = /usr/lib/libc.so.6(abort+0x116) [0x7f08040f4862]
[3] = <prefix>/lib/liblegion.so(+0x7b9a36) [0x7f0805f09a36]
[4] = <prefix>/lib/liblegion.so(Legion::Internal::IndexSpaceNodeT<1, long long>::get_index_space_domain(void*, unsigned int)+0x77) [0x7f0805fbca67]
[5] = <prefix>/lib/liblegion.so(Legion::Internal::PhysicalRegionImpl::get_instance_info(legion_privilege_mode_t, unsigned int, unsigned long, void*, unsigned int, char const*, bool, bool, bool, int)+0x26e) [0x7f0805f1102e]
[6] = <prefix>/lib/liblgnumpy.so(Legion::FieldAccessor<(legion_privilege_mode_t)1, double, 2, long long, Realm::AffineAccessor<double, 2, long long>, false> legate::LegateDeserializer::unpack_accessor_RO<double, 2>(Legion::PhysicalRegion const&, Realm::Rect<2, long long> const&)+0x244) [0x7f06e9cd0094]
[7] = <prefix>/lib/liblgnumpy.so(legate::numpy::TileTask<double>::cpu_variant(Legion::Task const*, std::vector<Legion::PhysicalRegion, std::allocator<Legion::PhysicalRegion> > const&, Legion::Internal::TaskContext*, Legion::Runtime*)+0x17f) [0x7f06eb028fdf]
[8] = <prefix>/lib/liblgnumpy.so(void Legion::LegionTaskWrapper::legion_task_wrapper<&(void legate::LegateTask<legate::numpy::TileTask<double> >::legate_task_wrapper<&legate::numpy::TileTask<double>::cpu_variant>(Legion::Task const*, std::vector<Legion::PhysicalRegion, std::allocator<Legion::PhysicalRegion> > const&, Legion::Internal::TaskContext*, Legion::Runtime*))>(void const*, unsigned long, void const*, unsigned long, Realm::Processor)+0x50) [0x7f06eb0334f0]
[9] = <prefix>/lib/librealm.so(+0x2ae179) [0x7f080488f179]
[10] = <prefix>/lib/librealm.so(+0x2ae236) [0x7f080488f236]
[11] = <prefix>/lib/librealm.so(+0x2b0b28) [0x7f0804891b28]
[12] = <prefix>/lib/librealm.so(+0x29001a) [0x7f080487101a]
[13] = /usr/lib/libc.so.6(+0x52540) [0x7f0804120540]
Either
[[2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]
[[2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]
Or NotImplementedError
.
Cunumeric-specific additional configurations to run our existing tests (in addition to generic configurations listed in nv-legate/legate.core#27):
LEGATE_TEST=1
).-cunumeric:test
). We may want to have a separate flag to force eager mode.My CPU is Intel's Comet Lake, which can not be detected by OpenBLAS v0.3.10 automatically (see OpenMathLib/OpenBLAS#2769). The solution suggested by OpenBLAS is either providing a flag TARGET=...
or using later versions. However, install.py
has hard-coded the OpenBLAS version to v0.3.10 and does not provide any methods to provide custom flags to build OpenBLAS.
Currently, I'm building my own OpenBLAS. I'm just thinking maybe it's nicer to have a way to provide custom OpenBLAS flags to install.py
? Or at least add some notes in README to let users know they have to build/install their own OpenBLAS if they have newer CPUs?
Thanks!
It would be good to periodically run performance regression tests, on representative hardware combinations. A project like https://github.com/spcl/npbench could be used as a starting point for a benchmark suite, or an actual benchmark run we did in the past.
Bug report due to @piyueh
The following code, when run with -lg:numpy:test
, prints [False]
, indicating that the slice has not been updated:
from legate import numpy
a = numpy.random.random((3, 3))
a[:, 0] = a[:, 2]
print(numpy.allclose(a[:, 0], a[:, 2]))
After some digging I found that we skip copies between sub-regions if they're backed by the same field, which is actually only safe if the slices are equivalent: https://github.com/nv-legate/legate.numpy/blob/2b460c5dfdd60b673e37e25231bf625fdf3ead0e/legate/numpy/deferred.py#L101-L105
If we simply skip this check then the copy ends up happening through a CopyTask, which works with subregions of the same base region.
However, the runtime errors out if the two slices overlap, e.g. if we do a[0,0:2] = a[0,1:3]
(vanilla NumPy accepts this, and does the expected thing). We should at least check for overlaps in python and produce a reasonable error message.
We also want to add a case for this to the test suite.
Copies that involve advanced indexing are implemented with a scatter/gather copy. The current code ignores the transforms on the RegionField
s, i.e. it doesn't take into account the translation from NumPy (local) index space (every view's indices start from 0) to Legion (global) index space (every subregion's indices start wherever that subregion is placed within the parent region).
In the general case the base, index and value arrays can all be views:
a = np.arange(10)
b = np.arange(10)
c = np.arange(9, -1, -1)
x = a[2:7][ c[5:10] ] # __get_item__
a[2:7][ c[5:10] ] = b[3:8] # __set_item__
The logic to handle the general case of __get_item__
might be:
RegionField
backing the newly created result array (which has no transform) as dst
in the gather/scatter copysrc
RegionField
as src_indirect
c[5:10] * 1
above), use the final array as src_indirect
\x. x+2
in the example above) on each element of the index array (the transform operation will naturally get rid of the index's transform, if any), use the output array as src_indirect
And for __set_item__
:
RegionField
as src
in the gather/scatter copy0:5 -> 3:8
for the example above), use that as src_indirect
dst
RegionField
as dst_indirect
c[5:10] * 1
above), use the final array as dst_indirect
\x. x+2
in the example above) on each element of the index array (the transform operation will naturally get rid of the index's transform, if any), use the output array as dst_indirect
The work on StanfordLegion/legion#705 would allow us to avoid materializing at least some of these fields.
Edit: add some more detail, enumerate more cases, fix typos
Using a scalar in allclose
raises AttributeError: PROJ_1D_1D_
.
test.py
from legate import numpy as lnp
import numpy as realnp
# vanilla numpy works
a = realnp.full(10, 1e-1)
print(realnp.allclose(a, 1e-1))
# legate numpy not working
la = lnp.full(10, 1e-1)
print(lnp.allclose(la, 1e-1))
test.py
with legate --cpus 1 ./test.py -lg:numpy:test
The first part that uses vanilla NumPy prints True
.
The second part that uses Legate NumPy raises:
Traceback (most recent call last):
File "<prefix>/lib/python3.8/site-packages/legion_top.py", line 408, in legion_python_main
run_path(args[start], run_name='__main__')
File "<prefix>/lib/python3.8/site-packages/legion_top.py", line 200, in run_path
exec(code, module.__dict__, module.__dict__)
File "./test.py", line 10, in <module>
print(lnp.allclose(la, 1e-1))
File "<prefix>/lib/python3.8/site-packages/legate/numpy/module.py", line 459, in allclose
return ndarray.perform_binary_reduction(
File "<prefix>/lib/python3.8/site-packages/legate/numpy/array.py", line 2068, in perform_binary_reduction
dst._thunk.binary_reduction(
File "<prefix>/lib/python3.8/site-packages/legate/numpy/deferred.py", line 5167, in binary_reduction
) = self.runtime.compute_broadcast_transform(
File "<prefix>/lib/python3.8/site-packages/legate/numpy/runtime.py", line 2500, in compute_broadcast_transform
self.first_proj_id + getattr(NumPyProjCode, proj_name),
File "<prefix>/lib/python3.8/enum.py", line 384, in __getattr__
raise AttributeError(name) from None
AttributeError: PROJ_1D_1D_
Working like vanilla NumPy, or raising an exception with a clear message of what is not supported.
I encountered a memory issue that is different from issue #33: runtime memory usage grows beyond the value of --fbmem
, and also the memory usage in the profiling result is still constant. I'm not sure whether I misconfigured something or not.
An example is the cg.py
from legate.numpy/example
. Run cg.py
on a A100 80GB-variant with
NUMPY_FIELD_REUSE_FREQ=1 \
legate --gpus 1 --fbmem 80000 --eager-alloc-percentage 1 \
./cg.py --num 235 --benchmark 10
The program crashed at the 3rd benchmark run with this error message: Internal Legate CUBLAS failure with error code 13 in file dot.cu at line 587
. It doesn't say anything about memory. But when I monitored the runtime memory usage through nvidia-smi
, the memory grew along with time, and the program crashed when the memory ran out.
The first thing I don't understand is that the memory grew on top of the memory allocated through --fbmem
. The second thing is that the profiling result does not show any memory growth. It remains constant in the profiling result.
The CUDA (including cublas) version is 11.3.0.
(Lower down --fbmem
to allow the memory growth also allows more benchmark runs. For example --fbmem 75000
allows all 10 benchmark runs to finish.)
This program (derived from tests/tensordot.py
):
import legate.numpy as lg
import numpy as np
a = lg.random.rand(3, 5, 4).astype(np.float16)
b = lg.random.rand(4, 5, 3).astype(np.float16)
a = lg.random.rand(3, 5, 4).astype(np.float16)
b = lg.random.rand(5, 4, 3).astype(np.float16)
cn = np.tensordot(a, b)
print('cn', flush=True)
print(cn, flush=True)
c = lg.tensordot(a, b)
print('c', flush=True)
print(c, flush=True)
assert np.allclose(cn, c)
when run as follows:
LEGATE_TEST=1 legate 79.py -lg:numpy:test --cpus 4
fails about 20% of the time, with:
cn
[[4.07 4.83 5.01 ]
[4.2 4.562 5.863]
[4.344 4.52 3.914]]
c
[[4.07 4.83 5.01 ]
[4.2 4.562 5.863]
[4.344 4.52 3.916]]
[0 - 700005133000] 0.946367 {6}{python}: python exception occurred within task:
Traceback (most recent call last):
File "/Users/mpapadakis/legate.core/install/lib/python3.8/site-packages/legion_top.py", line 410, in legion_python_main
run_path(args[start], run_name='__main__')
File "/Users/mpapadakis/legate.core/install/lib/python3.8/site-packages/legion_top.py", line 234, in run_path
exec(code, module.__dict__, module.__dict__)
File "79.py", line 16, in <module>
assert np.allclose(cn, c)
AssertionError
Currently legate.numpy partitions the region backing an array view (slice) without considering other views or the top-level array. This can lead to tile misalignment, e.g. on the stencil benchmark, where the top-level "grid" array has one extra cell on each side compared to the "center" view. If asked to split a 37839 x 37839 grid into 4 tiles across the X dimension, legate.numpy will create the following partitions:
tile 0 tile 1 tile 2 tile 3
center: 1-9460 9461-18920 18921-28380 28381-37837
grid: 0-9459 9460-18919 18920-28379 28380-37838
Notice that the boundaries are not aligned. This behavior has the potential to cause extra traffic, as we switch between the two partitions while working on the different arrays.
We could try to capture this case by partitioning regions for views following the top-level region partition. We would likely want to guard this optimization with a heuristic, to only apply it where it would actually be beneficial (e.g. the proposed "derived" partitioning is not horribly imbalanced, in which case we would prefer the original "equal" partitioning strategy for the view).
Note that this proposal doesn't cover the case where the partitioning of view A should be aligned with view B, but neither is related to the top-level array's partition. A lazy evaluation engine could make an informed decision even in this scenario, by waiting to tile until it has seen a number of tiled partitions that need to be made, at which point it can tile them together using something like the unification algorithm used by Regent's auto-parallelizer.
In the example of Jacobi solver, the code uses NameError
to catch import failure on legate.numpy
. If the intention is to handle the situation when legate.numpy
is not available/installed, is it better to catch ImportFailure
instead? Thanks.
The current RegionField
overlap test (meant to check whether an intra-array copy is safe to be implemented with a single Legion copy/task) is too conservative to be useful.
It ignores slice steps, and is very inaccurate when going from a 2d view to a 1d base array, e.g. for:
a = np.arange(25)
b = a.reshape((5, 5))
it will decide that b[3:5:, 0:2]
and b[3:5, 2:4]
overlap, because it translates the rectangles to the base 1d space, and considers the bounding boxes on that space (a[15:22]
and a[17:24]
in this example).
Something like a[::3]
does not work and gives TypeError: '<' not supported between instances of 'NoneType' and 'int'
, which is unclear about what happened.
test.py
from legate import numpy
a = numpy.random.random(100)
print(a[::3])
legate --cpus 1 ./test.py -lg:numpy:test
Traceback (most recent call last):
File "<prefix>/lib/python3.8/site-packages/legion_top.py", line 408, in legion_python_main
run_path(args[start], run_name='__main__')
File "<prefix>/lib/python3.8/site-packages/legion_top.py", line 200, in run_path
exec(code, module.__dict__, module.__dict__)
File "./test.py", line 3, in <module>
print(a[::3])
File "<prefix>/lib/python3.8/site-packages/legate/numpy/array.py", line 382, in __getitem__
shape=None, thunk=self._thunk.get_item(key, stacklevel=2)
File "<prefix>/lib/python3.8/site-packages/legate/numpy/deferred.py", line 428, in get_item
view, dim_map = self._get_view(key)
File "<prefix>/lib/python3.8/site-packages/legate/numpy/thunk.py", line 378, in _get_view
view = (self._standardize_slice_key(key, 0),)
File "<prefix>/lib/python3.8/site-packages/legate/numpy/thunk.py", line 325, in _standardize_slice_key
or (key.stop < 0 and -key.step > diff)
TypeError: '<' not supported between instances of 'NoneType' and 'int'
Not urgent because it does not fail silently and still gives some error after all.
When using a boolean array to access elements of another array (with a non-trivial size), it raises an error:
TypeError: nonzero() missing 1 required positional argument: 'stacklevel'
test.py
with the following code:
from legate import numpy
qw = numpy.random.random((100, 100))
qw[qw < 0.3] = 1.0
test.py
with legate
, e.g.,
$ legate --cpus 1 ./test.py
Traceback (most recent call last):
File "<blahblah>/python3.8/site-packages/legion_top.py", line 394, in legion_python_main
run_path(args[start], run_name='__main__')
File "<blahblah>/lib/python3.8/site-packages/legion_top.py", line 193, in run_path
exec(code, module.__dict__, module.__dict__)
File "./test.py", line 3, in <module>
qw[qw < 0.3] = 1.0
File "<blahblah>/lib/python3.8/site-packages/legate/numpy/array.py", line 753, in __setitem__
self._thunk.set_item(key, value_array._thunk, stacklevel=2)
File "<blahblah>/lib/python3.8/site-packages/legate/numpy/deferred.py", line 487, in set_item
index_array = self._create_indexing_array(
File "<blahblah>/lib/python3.8/site-packages/legate/numpy/deferred.py", line 324, in _create_indexing_array
tuple_of_arrays = key.nonzero()
TypeError: nonzero() missing 1 required positional argument: 'stacklevel'
Either it works or raises a NotImplementedError
so that users know it has not yet been implemented.
qw
in this example, everything is working fine.legate --gpus 1 ./test.py
, works fine.Is there any plan for the NDArray?
I installed legate.numpy on legate.core on my Ubuntu 18.04. After I manually exported LEGATE_MAX_DIMS and LEGATE_MAX_FIELDS to environment variables, I further got the following error when I import legate.numpy. Any idea how to fix this?
Python 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import legate.numpy as np
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/kwu/anaconda3/envs/legate/lib/python3.8/site-packages/legate/numpy/__init__.py", line 21, in <module>
from legate.numpy import linalg, random
File "/home/kwu/anaconda3/envs/legate/lib/python3.8/site-packages/legate/numpy/linalg/__init__.py", line 19, in <module>
from legate.numpy.linalg.linalg import *
File "/home/kwu/anaconda3/envs/legate/lib/python3.8/site-packages/legate/numpy/linalg/linalg.py", line 18, in <module>
from legate.numpy.array import ndarray, runtime
File "/home/kwu/anaconda3/envs/legate/lib/python3.8/site-packages/legate/numpy/array.py", line 24, in <module>
from legate.core import LegateArray
File "/home/kwu/anaconda3/envs/legate/lib/python3.8/site-packages/legate/core/__init__.py", line 19, in <module>
from legate.core.legate import (
File "/home/kwu/anaconda3/envs/legate/lib/python3.8/site-packages/legate/core/legate.py", line 32, in <module>
from legate.core.legion import (
File "/home/kwu/anaconda3/envs/legate/lib/python3.8/site-packages/legate/core/legion.py", line 858, in <module>
class IndexPartition(object):
File "/home/kwu/anaconda3/envs/legate/lib/python3.8/site-packages/legate/core/legion.py", line 868, in IndexPartition
part_id=legion.legion_auto_generate_id(),
File "/home/kwu/anaconda3/envs/legate/lib/python3.8/site-packages/cffi/api.py", line 912, in __getattr__
make_accessor(name)
File "/home/kwu/anaconda3/envs/legate/lib/python3.8/site-packages/cffi/api.py", line 908, in make_accessor
accessors[name](name)
File "/home/kwu/anaconda3/envs/legate/lib/python3.8/site-packages/cffi/api.py", line 838, in accessor_function
value = backendlib.load_function(BType, name)
AttributeError: function/symbol 'legion_auto_generate_id' not found in library '<None>': python: undefined symbol: legion_auto_generate_id
I'm executing gemm.py on 8 nodes with this command line and see the following error
/g/g15/yadav2/legate.core/install/bin/legate /g/g15/yadav2/legate.numpy/examples/gemm.py -n 46340 -p 64 -i 10 --num_nodes 32 --omps 2 --ompthreads 18 --nodes 32 --numamem 30000 --eager-alloc-percentage 1 --cpus 1 --sysmem 10000 --launcher jsrun --cores-per-node 40 --verbose
Running: jsrun -n 32 -r 1 -a 1 -c 40 -g 0 -b none /g/g15/yadav2/legate.core/install/bin/legion_python /g/g15/yadav2/legate.numpy/examples/gemm.py -n 46340 -p 64 -i 10 --num_nodes 32 -ll:py 1 -lg:local 0 -ll:ocpu 2 -ll:othr 18 -ll:onuma 1 -ll:util 2 -ll:bgwork 2 -ll:csize 10000 -ll:nsize 30000 -ll:ncsize 0 -level openmp=5 -lg:eager_alloc_percentage 1
[7 - 20003ac2f8b0] 5.675422 {5}{runtime}: [error 605] LEGION ERROR: Illegal output shard 32 from sharding functor 1073741900. Shards for this index space launch must be between 0 and 32 (exclusive). (from file /g/g15/yadav2/legate.core/legion/runtime/legion/runtime.cc:15560)
For more information see:
http://legion.stanford.edu/messages/error_code.html#error_code_605
You have removed "outdated" dockerfiles. Are there plans to restore them?
Is there a way to programmatically force Legate to wait for the completion of all pending operations? In the examples, the way to go is to read the output basically, i.e., assert not math.isnan(np.sum(output))
. Is there a different way that doesn't incur the penalty of accessing all output elements?
I believe I have built legate.core and legate.numpy correctly. I can run "legate" and get a prompt.
I used the last example from https://github.com/barbagroup/CFDPython/blob/master/lessons/15_Step_12.ipynb as mentioned on the github page. I used the default grid of 41x 41 with nit=250 to get a longer runtime. However, I can't seem to get the code to run on the GPU. I put the code into a file and run it as "legate cfd.py".
I'm running this on a laptop with a 4GB GeForce 1650 GPU (I built using the Volta architecture). When I use the option "--gpus 1" it tells me I don't have enough memory. Is the number after "--gpus" referring to the "number" of GPUs or does it refer to the device numbers? When I tried "--gpus 0", thinking it was referring to the device, it runs but it runs at the same speed as the CPU. Plus "nvidia-smi" doesn't show the code ever running on the GPU.
BTW - I built legate-core using the following command:
./install.py --cuda --with-cuda /opt/nvidia/hpc_sdk/Linux_x86_64/21.3/ --arch volta --install-dir /usr/local/legate
I built legate.numpy using the following command:
python setup.py --with-core /usr/local/legate
BTW - is there a way to check that legate was built using GPUs beyond just running a code and trying to force it to run on a GPU?
Thanks!
Jeff
We should at least pass the same tests that Numpy uses, potentially replicated at multiple scales. Some bugs only became visible when the array size is substantial.
I believe examples like the following would raise an interfering requirement error, in both master
and branch-21.10
:
legate.numpy.add(x[1:5], 1, out=x[2:6])
We need to fix the operators whose outputs can be redirected such that the intermediate results are materialized before getting assigned to the designated arrays. The in-place update code already handles this using the alias check on Legate Stores, so we can reuse that code.
numpy.linspace
does not work correctly and returns an all-zero array without throwing any runtime errors or warnings.
legate.numpy
's own testslegate.numpy
's source code folder./test.py --use cuda --gpus 1
test.py
from legate import numpy
a = numpy.linspace(0.0, 4.0, 501)
print(a.mean(), a.sum())
print(a)
legate --gpus 1 ./test.py -lg:numpy:test
When using legate.numpy
's own test suite, the result show the test for linspace
failed.
When using the test script in option 2, the result shows an all-zero array.
I noticed linspace
is not listed in the Legate NumPy API reference, so I think linspace
belongs to the group that is not implemented yet
? Then, in this case, the function should return NotImplementedError
instead of just an all-zero array.
During loops, memory usage keeps growing when it should keep constant. Looks like a garbage-collection issue.
examples/stencil.py
(take more time to see the crash)legate.numpy/examples/stencil.py
legate --cpus 1 --sysmem 1500 --eager-alloc-percentage 1 ./stencil.py --num 3000 --benchmark 20
(I lower down the system memory to make the out-of-memory happen faster.)The crash happens at about the 14th benchmark iteration, so to get a profiling result, change --benchmark
to 13. That is, legate --profile --cpus 1 --sysmem 1500 --eager-alloc-percentage 1 ./stencil.py --num 3000 --benchmark 13 -lg:numpy:test -lg:inorder
.
Here is the profiling result: legate_prof.tar.gz
test.py
from legate import numpy
a0 = numpy.random.random((1004, 1004))
b0 = numpy.random.random((1004, 1004))
c0 = numpy.random.random((1004, 1004))
counter = 0
while True:
a = a0.copy()
b = b0.copy()
c = c0.copy()
for i in range(2):
a[2:-2, i] = a[2:-2, 2].copy()
b[2:-2, i] = b[2:-2, 2].copy()
c[2:-2, i] = c[2:-2, 2].copy()
for i in range(-3):
a[2:-2, i] = a[2:-2, -3].copy()
b[2:-2, i] = b[2:-2, -3].copy()
c[2:-2, i] = c[2:-2, -3].copy()
for i in range(2):
a[i, 2:-2] = a[2, 2:-2].copy()
b[i, 2:-2] = b[2, 2:-2].copy()
c[i, 2:-2] = c[2, 2:-2].copy()
for i in range(-3):
a[i, 2:-2] = a[-3, 2:-2,].copy()
b[i, 2:-2] = b[-3, 2:-2,].copy()
c[i, 2:-2] = c[-3, 2:-2,].copy()
counter += 1
print(counter)
legate --cpus 1 --sysmem 750 --eager-alloc-percentage 1 ./test.py
.The out-of-memory happened at 935th iteration. So to get a profiling output, add if counter % 934 == 0: break
after print(counter)
, and then do legate --profile --cpus 1 --sysmem 750 --eager-alloc-percentage 1 ./test.py -lg:numpy:test -lg:inorder
.
Here's the profiling result: legate_prof.tar.gz
Hi, in the NumPyProjectionFunctorRadix2D (similar in NumPyProjectionFunctorRadix3D)
https://github.com/nv-legate/legate.numpy/blob/896f4fd9b32db445da6cdabf7b78d523fca96936/src/proj.cc#L528
there are three parameters: template <int DIM, int RADIX, int OFFSET>.
The DIM is the dimension of a tensor. What about RADIX and OFFSET? I notice that in the register_projection_functors function, the RADIX is given as 4 and OFFSET range from 0 to 3.
register_functor<NumPyProjectionFunctorRadix3D<0, 4, 0>>(
runtime, offset, NUMPY_PROJ_RADIX_3D_X_4_0);
How about 4D or NDArray? Should I format the 4D tensor projection function as follows:
register_functor<NumPyProjectionFunctorRadix4D<0, 4, 0>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_X_4_0);
register_functor<NumPyProjectionFunctorRadix4D<0, 4, 1>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_X_4_0);
register_functor<NumPyProjectionFunctorRadix4D<0, 4, 2>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_X_4_0);
register_functor<NumPyProjectionFunctorRadix4D<0, 4, 3>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_X_4_0);
register_functor<NumPyProjectionFunctorRadix4D<1, 4, 0>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_Y_4_0);
register_functor<NumPyProjectionFunctorRadix4D<1, 4, 1>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_Y_4_0);
register_functor<NumPyProjectionFunctorRadix4D<1, 4, 2>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_Y_4_0);
register_functor<NumPyProjectionFunctorRadix4D<1, 4, 3>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_Y_4_0);
register_functor<NumPyProjectionFunctorRadix4D<2, 4, 0>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_Z_4_0);
register_functor<NumPyProjectionFunctorRadix4D<2, 4, 1>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_Z_4_0);
register_functor<NumPyProjectionFunctorRadix4D<2, 4, 2>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_Z_4_0);
register_functor<NumPyProjectionFunctorRadix4D<2, 4, 3>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_Z_4_0);
register_functor<NumPyProjectionFunctorRadix4D<3, 4, 0>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_D_4_0);
register_functor<NumPyProjectionFunctorRadix4D<3, 4, 1>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_D_4_0);
register_functor<NumPyProjectionFunctorRadix4D<3, 4, 2>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_D_4_0);
register_functor<NumPyProjectionFunctorRadix4D<3, 4, 3>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_D_4_0);
?
When having sys.exit(0)
to indicate a program's termination, Legate seems to catch the SystemExit
raised by sys.exit(0)
and treat it as an error.
test1.py
and test2.py
with these contents:
test1.py
import sys
from legate import numpy
sys.exit(0)
test2.py
import sys
from legate import numpy
legate
, for example:
$ legate --cpus 1 test1.py
$ legate --cpus 1 test2.py
Both scripts are supposed to output nothing. However, test1.py
returns this message
[0 - 7f34317b87c0] 0.807057 {6}{python}: python exception occurred within task:
I guess Legate catches the SystemExit
from sys.exit
and treat it as a normal exception, i.e., en error.
This is clearly an issue in OpenBLAS but it blocks my Legate Numpy install and is unexpected, based on my experience with OpenBLAS in other contexts.
jhammond@nuclear:~/LEGATE/np$ python3 ./install.py --install-dir $HOME/LEGATE --with-core $HOME/LEGATE 2>&1 | tee log
Verbose build is off
Legate is installing OpenBLAS into a local directory...
Cloning into '/tmp/tmpm780ryjm'...
Note: switching to 'd2b11c47774b9216660e76e2fc67e87079f26fa1'.
You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.
If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:
git switch -c <new-branch-name>
Or undo this operation with:
git switch -
Turn off this advice by setting config variable advice.detachedHead to false
Switched to a new branch 'master'
getarch_2nd.c: In function ‘main’:
getarch_2nd.c:14:35: error: ‘SGEMM_DEFAULT_UNROLL_M’ undeclared (first use in this function); did you mean ‘SBGEMM_DEFAULT_UNROLL_M’?
14 | printf("SGEMM_UNROLL_M=%d\n", SGEMM_DEFAULT_UNROLL_M);
| ^~~~~~~~~~~~~~~~~~~~~~
| SBGEMM_DEFAULT_UNROLL_M
getarch_2nd.c:14:35: note: each undeclared identifier is reported only once for each function it appears in
getarch_2nd.c:15:35: error: ‘SGEMM_DEFAULT_UNROLL_N’ undeclared (first use in this function); did you mean ‘SBGEMM_DEFAULT_UNROLL_N’?
15 | printf("SGEMM_UNROLL_N=%d\n", SGEMM_DEFAULT_UNROLL_N);
| ^~~~~~~~~~~~~~~~~~~~~~
| SBGEMM_DEFAULT_UNROLL_N
getarch_2nd.c:16:35: error: ‘DGEMM_DEFAULT_UNROLL_M’ undeclared (first use in this function); did you mean ‘XGEMM_DEFAULT_UNROLL_M’?
16 | printf("DGEMM_UNROLL_M=%d\n", DGEMM_DEFAULT_UNROLL_M);
| ^~~~~~~~~~~~~~~~~~~~~~
| XGEMM_DEFAULT_UNROLL_M
getarch_2nd.c:17:35: error: ‘DGEMM_DEFAULT_UNROLL_N’ undeclared (first use in this function); did you mean ‘QGEMM_DEFAULT_UNROLL_N’?
17 | printf("DGEMM_UNROLL_N=%d\n", DGEMM_DEFAULT_UNROLL_N);
| ^~~~~~~~~~~~~~~~~~~~~~
| QGEMM_DEFAULT_UNROLL_N
getarch_2nd.c:21:35: error: ‘CGEMM_DEFAULT_UNROLL_M’ undeclared (first use in this function); did you mean ‘XGEMM_DEFAULT_UNROLL_M’?
21 | printf("CGEMM_UNROLL_M=%d\n", CGEMM_DEFAULT_UNROLL_M);
| ^~~~~~~~~~~~~~~~~~~~~~
| XGEMM_DEFAULT_UNROLL_M
getarch_2nd.c:22:35: error: ‘CGEMM_DEFAULT_UNROLL_N’ undeclared (first use in this function); did you mean ‘QGEMM_DEFAULT_UNROLL_N’?
22 | printf("CGEMM_UNROLL_N=%d\n", CGEMM_DEFAULT_UNROLL_N);
| ^~~~~~~~~~~~~~~~~~~~~~
| QGEMM_DEFAULT_UNROLL_N
getarch_2nd.c:23:35: error: ‘ZGEMM_DEFAULT_UNROLL_M’ undeclared (first use in this function); did you mean ‘XGEMM_DEFAULT_UNROLL_M’?
23 | printf("ZGEMM_UNROLL_M=%d\n", ZGEMM_DEFAULT_UNROLL_M);
| ^~~~~~~~~~~~~~~~~~~~~~
| XGEMM_DEFAULT_UNROLL_M
getarch_2nd.c:24:35: error: ‘ZGEMM_DEFAULT_UNROLL_N’ undeclared (first use in this function); did you mean ‘QGEMM_DEFAULT_UNROLL_N’?
24 | printf("ZGEMM_UNROLL_N=%d\n", ZGEMM_DEFAULT_UNROLL_N);
| ^~~~~~~~~~~~~~~~~~~~~~
| QGEMM_DEFAULT_UNROLL_N
getarch_2nd.c:71:50: error: ‘SGEMM_DEFAULT_Q’ undeclared (first use in this function); did you mean ‘SBGEMM_DEFAULT_Q’?
71 | printf("#define SLOCAL_BUFFER_SIZE\t%ld\n", (SGEMM_DEFAULT_Q * SGEMM_DEFAULT_UNROLL_N * 4 * 1 * sizeof(float)));
| ^~~~~~~~~~~~~~~
| SBGEMM_DEFAULT_Q
getarch_2nd.c:72:50: error: ‘DGEMM_DEFAULT_Q’ undeclared (first use in this function); did you mean ‘SBGEMM_DEFAULT_Q’?
72 | printf("#define DLOCAL_BUFFER_SIZE\t%ld\n", (DGEMM_DEFAULT_Q * DGEMM_DEFAULT_UNROLL_N * 2 * 1 * sizeof(double)));
| ^~~~~~~~~~~~~~~
| SBGEMM_DEFAULT_Q
getarch_2nd.c:73:50: error: ‘CGEMM_DEFAULT_Q’ undeclared (first use in this function); did you mean ‘SBGEMM_DEFAULT_Q’?
73 | printf("#define CLOCAL_BUFFER_SIZE\t%ld\n", (CGEMM_DEFAULT_Q * CGEMM_DEFAULT_UNROLL_N * 4 * 2 * sizeof(float)));
| ^~~~~~~~~~~~~~~
| SBGEMM_DEFAULT_Q
getarch_2nd.c:74:50: error: ‘ZGEMM_DEFAULT_Q’ undeclared (first use in this function); did you mean ‘SBGEMM_DEFAULT_Q’?
74 | printf("#define ZLOCAL_BUFFER_SIZE\t%ld\n", (ZGEMM_DEFAULT_Q * ZGEMM_DEFAULT_UNROLL_N * 2 * 2 * sizeof(double)));
| ^~~~~~~~~~~~~~~
| SBGEMM_DEFAULT_Q
make: *** [Makefile.prebuild:74: getarch_2nd] Error 1
Makefile:154: *** OpenBLAS: Detecting CPU failed. Please set TARGET explicitly, e.g. make TARGET=your_cpu_target. Please read README for the detail.. Stop.
Traceback (most recent call last):
File "./install.py", line 543, in <module>
driver()
File "./install.py", line 539, in driver
install_legate_numpy(unknown=unknown, **vars(args))
File "./install.py", line 359, in install_legate_numpy
install_openblas(openblas_dir, thread_count, verbose)
File "./install.py", line 143, in install_openblas
execute_command(
File "./install.py", line 62, in execute_command
subprocess.check_call(args, cwd=cwd, shell=shell)
File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['make', '-j', '8', 'USE_THREAD=1', 'NO_STATIC=1', 'USE_OPENMP=1', 'NUM_PARALLEL=32', 'LIBNAMESUFFIX=legate']' returned non-zero exit status 2.
Currently legate.numpy will either tile an array across all the nodes in the machine, or place the entire array on a single node. Both strategies may be suboptimal in certain cases, e.g. the jacobi benchmark, where the following matrix-vector multiplication happens repeatedly every timestep (R
is a NxN
matrix, all other variables are Nx1
vectors):
x = (b - np.dot(R, x)) / d
The optimal partitioning for the vectors is to split them across sqrt(N)
nodes, one for each column tile of the matrix R
.
Besides providing the mechanism for this, someone needs to advise legate.numpy what the optimal partitioning is for each operation's output array. Doing this requires looking ahead at how this array is later used, and thus requires lazy evaluation.
When doing a division that requires a 1D denominator to be broadcasted to a 2D array, and when the shape is larger than a certain size, an exception is raised
Traceback (most recent call last):
File "<removed>/lib/python3.8/site-packages/legion_top.py", line 394, in legion_python_main
run_path(args[start], run_name='__main__')
File "<removed>/lib/python3.8/site-packages/legion_top.py", line 193, in run_path
exec(code, module.__dict__, module.__dict__)
File "./test1.py", line 14, in <module>
c = (a[:, 1:] - a[:, :-1]) / (b[1:] - b[:-1])
File "<removed>/lib/python3.8/site-packages/legate/numpy/array.py", line 776, in __truediv__
return self.internal_truediv(
File "<removed>/lib/python3.8/site-packages/legate/numpy/array.py", line 519, in internal_truediv
return self.perform_binary_op(
File "<removed>/lib/python3.8/site-packages/legate/numpy/array.py", line 2054, in perform_binary_op
out._thunk.binary_op(
File "<removed>/lib/python3.8/site-packages/legate/numpy/deferred.py", line 4876, in binary_op
) = self.runtime.compute_broadcast_transform(
File "<removed>/lib/python3.8/site-packages/legate/numpy/runtime.py", line 2605, in compute_broadcast_transform
raise NotImplementedError(
NotImplementedError: Legate needs support for more than 3 dimensions
test.py
. Its content is:
from legate import numpy
a = numpy.random.random((400, 2001))
b = numpy.random.random(2001)
c = (a[:, 1:] - a[:, :-1]) / (b[1:] - b[:-1])
test.py
with legate
. I'm using this one for my test:
$ legate --cpus 1 ./test.py
Using different shapes/sizes to generate a
and b
seems to also affect the errors. Smaller shapes/sizes do not give any error. For example, (4, 21)
and (21,)
for a
and b
respectively do not raise any errors.
Also, using the same shape but with different runtime flags may or may not return errors. For example, using (40, 201)
for a
and (201,)
for b
:
legate --cpus 0 --omps 1 --ompthreads 1 ./test.py
works fine. No error.legate --cpus 0 --omps 1 --ompthreads -ll:okindhack 1 ./test.py
returns the error of NotImplementedError: Legate needs support for more than 3 dimensions
.If I explicitly do the broadcasting before the division, everything is fine.
Currently legate.numpy will remove any argument it recognizes from the command line when the Runtime singleton is constructed (i.e. when the legate.numpy module is loaded):
This matches the way legate.core and Legion/Realm work, with each layer removing the arguments it recognizes. Legate.core does this through the launcher script, for which we automatically get documentation from argparse. Legate.numpy's arguments, however, are only known to the Runtime constructor, and not documented anywhere else. -lg:numpy:test
and -lg:numpy:shadow
are developer options so I don't think they need to be publicly documented, but AFAIK -lg:numpy:summarize
is a user-targeted option, so we likely want to document it somehow (ideally in the launcher script, if we detect that the legate.numpy module has been loaded).
The following program:
from legate import numpy
a = numpy.arange(50)
indices = numpy.arange(10)
print(a[indices])
causes the runtime to emit this warning:
[0 - 7fdf65929700] 2.045884 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 1 of operation Copy (UID 5) in parent task legion_python_main (UID 1) is using uninitialized data for field(s) 1048578 of logical region (2,1,2) (from file /gpfs/fs1/mpapadakis/legate.core/legion/runtime/legion/legion_ops.cc:1192)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
Hi. I'm trying legate.numpy on Piz Daint. The following should give a symmetric matrix, however with legate.numpy, it gives a matrix of zeros:
import legate.numpy as np
x = np.random.random((60, 30))
def euclidean_broadcast(x, y):
"""r_ij = (x_ij - y_ij)^2"""
diff = x[:, np.newaxis, :] - y[np.newaxis, :, :]
return (diff * diff).sum(axis=2)
edm = euclidean_broadcast(x, x)
print(edm[:3, :3])
I'm running with
legate --launcher srun --nodes 1 --gpus 1 --fbmem 14000 edm.py -lg:numpy:test --eager-alloc-percentage 5
edm.py
is the script with code above.
Everything looks fine up to the (diff * diff)
. It's the sum(axis=2)
what gives the zeros. It happens when using --gpus
. The cpu version works fine.
I'm using this commits:
legate.numpy 496c64d (2021-05-12)
legate.core 9e327b7 (2021-05-12)
Please, let me know if you need more information. Thanks in advance!
Advanced indexing of a relatively huge (e.g., length 10K) 1D array returns UnboundLocalError: local variable 'shardfn' referenced before assignment
, rather than NotImplementedError
.
I understand that advanced indexing is mostly not yet implemented. Most related routines raise NotImplementedError
to let users know about this situation. However, this particular use case raises this different error, which seems to be a bug to me.
test.py
:
from legate import numpy
a = numpy.arange(10000)
print(a[(1, 2, 3), ])
$ legate --cpus 1 test.py
Traceback (most recent call last):
File "<blahblah>/lib/python3.8/site-packages/legion_top.py", line 394, in legion_python_main
run_path(args[start], run_name='__main__')
File "<blahblah>/lib/python3.8/site-packages/legion_top.py", line 193, in run_path
exec(code, module.__dict__, module.__dict__)
File "./test.py", line 3, in <module>
print(a[(1, 2, 3), ])
File "<blahblah>/lib/python3.8/site-packages/legate/numpy/array.py", line 381, in __getitem__
shape=None, thunk=self._thunk.get_item(key, stacklevel=2)
File "<blahblah>/lib/python3.8/site-packages/legate/numpy/deferred.py", line 414, in get_item
copy = Copy(mapper=self.runtime.mapper_id, tag=shardfn)
UnboundLocalError: local variable 'shardfn' referenced before assignment
Either [1, 2, 3]
or NotImplementedError
.
a = numpy.arange(100)
, the code works fine.legate --gpus 1 test.py
works fine. This is interesting, as the GPU implementation seems to be more stable than CPU implementation?OpenBLAS released version 0.3.15. We should update to the latest version.
It seems that one of the recent commits (possibly e24dbdd) introduced the following error for some codes:
[0 - 7f743c098700] 10.165263 {5}{runtime}: [error 68] LEGION ERROR: Region requirement 1 of operation legate::numpy::NoncommutativeBinaryUniversalFunction<legate::numpy::SubtractOperation<double> >::NormalTask (UID 164) in parent task legion_python_main (UID 1) is using uninitialized data for field(s) 1048579 of logical region (16,1,1) with read-only privileges (from file /gpfs/fs1/mzalewski/repos/quickstart-collection/legate.core/legion/runtime/legion/legion_ops.cc:1170)
Consider the following program:
a = np.arange(25).reshape((5,5))
print(np.nonzero(a > 17))
When ran with python NumPy (or legate.numpy on 1 CPU) the non-zero indices are returned in this order:
x = [3 3 4 4 4 4 4]
y = [3 4 0 1 2 3 4]
If instead we run with legate.numpy on 4 CPUs (using NUMPY_TEST
to force legate.numpy to do distributed execution) (command line: NUMPY_TEST=1 legate nz-order.py -lg:numpy:test --cpus 4
) we get:
x = [4 4 4 3 3 4 4]
y = [0 1 2 3 4 3 4]
I.e. legate.numpy returns the indices grouped by tile (see nz-order.pdf for a visualization), instead of returning them according to the global row-major order, as is guaranteed in the NumPy API. This is a side effect of how distributed nonzero is implemented. Making this work like NumPy would require a sort after every nonzero call.
We could simply decide to live with this incompatibility, since I expect most code using nonzero will not explicitly depend on the order that nonzero elements are returned in. The most likely scenario I can think of where this incompatibility would be problematic is if the user code mixes the results of different nonzero calls in the same operation:
import legate.numpy as np
# too small to be partitioned; indices will be in C order
small = np.ones((2,2))
small_is = np.nonzero(small)
# large enough to be partitioned; indices will be grouped by tile.
large = np.zeros((10000,10000))
large[2500,2500] = 2.0
large[2500,7500] = 3.0
large[2501,2500] = 4.0
large[2501,7500] = 5.0
large_is = np.nonzero(large)
small[small_is] = large[large_is]
print(small)
Whether OpenBLAS will be built with OpenMP or not is determined by the result of function has_openmp()
. The result of has_openmp()
depends on g++
compilation outcome. That says, even if I use --no-openmp
to install Legate NumPy, the underlying OpenBLAS is still built with OpenMP (if my g++
supports it). Is this a designed behavior? Thanks.
The documentation of --no-openmp
says "Build Legate NumPy with OpenMP". I'm kind of confused. Does --no-openmp
disable or enable OpenMP? Thanks!
Location: line 1622 in array.py
https://github.com/nv-legate/legate.numpy/blob/3452c85f93c4a886e9f4bff5f2e87b20f98b30bf/legate/numpy/array.py#L1622
I believe this is a bug. This looks like a typo to convert_to_legate_ndarray
(the method at line 109 to 118 in the same file):
https://github.com/nv-legate/legate.numpy/blob/3452c85f93c4a886e9f4bff5f2e87b20f98b30bf/legate/numpy/array.py#L109-L118
If an example is needed to see how this is triggered, here's one:
step 1: create test.py
:
from legate import numpy
a = numpy.array([1, 2, 3, 0, 4, 5, 6], dtype=float)
b = numpy.array([1, 0, 3, 0, 4, 5, 0], dtype=float)
d = numpy.divide(a, b, out=numpy.zeros_like(b), where=(b != 0))
print("d:", d)
step 2: run with, e.g., $ legate --cpus 1 ./test.py
The output:
Traceback (most recent call last):
File "<prefix>/lib/python3.8/site-packages/legion_top.py", line 394, in legion_python_main
run_path(args[start], run_name='__main__')
File "<prefix>/lib/python3.8/site-packages/legion_top.py", line 193, in run_path
exec(code, module.__dict__, module.__dict__)
File "./test.py", line 4, in <module>
d = numpy.divide(a, b, out=numpy.zeros_like(b), where=(b != 0))
File "<prefix>/lib/python3.8/site-packages/legate/numpy/module.py", line 853, in divide
return true_divide(
File "<prefix>/lib/python3.8/site-packages/legate/numpy/module.py", line 1050, in true_divide
return ndarray.perform_binary_op(
File "<prefix>/lib/python3.8/site-packages/legate/numpy/array.py", line 2058, in perform_binary_op
cls.get_where_thunk(
File "<prefix>/lib/python3.8/site-packages/legate/numpy/array.py", line 1622, in get_where_thunk
array = cls.convert_to_legate_array(where)
AttributeError: type object 'ndarray' has no attribute 'convert_to_legate_array'
After changing it to convert_to_legate_ndarray
, the division works as expected.
I've been testing a simple Laplace Eq. solver to compare Python+Numpy to legate.numpy and legate is hugely slower than Numpy.
The code is taken from: https://barbagroup.github.io/essential_skills_RRC/laplace/1/ . The code I actually run is the following:
import numpy as np
import time
def L2_error(p, pn):
return np.sqrt(np.sum((p - pn)**2)/np.sum(pn**2))
# end if
def laplace2d(p, l2_target):
'''Iteratively solves the Laplace equation using the Jacobi method
Parameters:
----------
p: 2D array of float
Initial potential distribution
l2_target: float
target for the difference between consecutive solutions
Returns:
-------
p: 2D array of float
Potential distribution after relaxation
'''
l2norm = 1.0
icount = 0
tot_time = 0.0
pn = np.empty_like(p)
while l2norm > l2_target:
start = time.perf_counter()
icount = icount + 1
pn = p.copy()
p[1:-1,1:-1] = .25 * (pn[1:-1,2:] + pn[1:-1, :-2] \
+ pn[2:, 1:-1] + pn[:-2, 1:-1])
##Neumann B.C. along x = L
p[1:-1, -1] = p[1:-1, -2] # 1st order approx of a derivative
l2norm = L2_error(p, pn)
end = time.perf_counter()
tot_time = tot_time + (end-start)
# end while
print("l2norm = ",l2norm)
print("icount = ",icount)
print("Total Iteration Time = ",tot_time)
print(" Time per iteration = ",tot_time/icount)
return p
# end if
if __name__ == "__main__":
nx = 401
ny = 401
# Initial conditions
p = np.zeros((ny,nx)) ##create a XxY vector of 0's
# Dirichlet boundary conditions
x = np.linspace(0,1,nx)
p[-1,:] = np.sin(1.5*np.pi*x/x[-1])
del x
start = time.time()
p = laplace2d(p.copy(), 1e-8)
stop = time.time()
print("Elapsed time = ",(stop-start)," secs")
print(" ")
# end if
When I run it on my laptop with Anaconda Python3 and Numpy I get the following:
$ python3 jacobi.py
l2norm = 9.99986062249016e-09
icount = 153539
Total Iteration Time = 127.02529454990054
Time per iteration = 0.0008273161512703648
Elapsed time = 127.14257955551147 secs
When I change the import line to legate.numpy, I usually stop the code after 15 minutes of wall time. I have let it run for up to 60 minutes and it never converges.
As a check, I've run the Numpy code with legate itself and it exactly matches the Numpy results.
I have been experimenting with replacing the l2norm computations with numpy specific functions (np.subtract, np.square, etc.) but I have achieved no increase in performance.
Does anyone have any recommendations?
Thanks!
Jeff
(edit by Manolis: added some formatting for the code sections)
When using numpy.int32
or numpy.int64
as indices to get an element from a 2D array, the code triggers something that is not implemented, i.e., NotImplementedError
. However, if converting the indices to native int
, then everything works.
I'm not sure if this is just an unimplemented feature, or if something's wrong, as this should be basic indexing and no advanced indexing involved.
numpy.int64
seems to be the default type of the element returned by looping a NumPy integer array. So I feel this is a common use case.
test_1.py
and test_2.py
test_1.py
from legate import numpy
a = numpy.random.random((1000, 2000))
idx = numpy.random.randint(0, 99, 10, int)
idy = numpy.random.randint(0, 99, 10, int)
for i, j in zip(idx, idy):
print("index type: ({}, {}); ".format(type(i), type(j)), end="")
print("a[i, j] = {}".format(a[i, j]))
test_2.py
from legate import numpy
a = numpy.random.random((1000, 2000))
idx = numpy.random.randint(0, 99, 10, int)
idy = numpy.random.randint(0, 99, 10, int)
for i, j in zip(idx, idy):
i, j = int(i), int(j)
print("index type: ({}, {}); ".format(type(i), type(j)), end="")
print("a[i, j] = {}".format(a[i, j]))
$ legate --cpus 1 ./test_1.py
$ legate --cpus 1 ./test_2.py
Both test_1.py
and test_2.py
should both output 10 lines of message in a format of index type: (class XXX, class XXX); a[i, j] = YYYYYYYYYYY
. XXX
is either numpy.int64
or int
, depending on it's test_1.py
or test_2.py
, while YYYYYYYYYYY
are random numbers.
test_1.py
reports an error (but successfully prints the first part on the message of the first line, i.e, index type: ...
):
Traceback (most recent call last):
File "<prefix>/lib/python3.8/site-packages/legion_top.py", line 394, in legion_python_main
run_path(args[start], run_name='__main__')
File "<prefix>/lib/python3.8/site-packages/legion_top.py", line 193, in run_path
exec(code, module.__dict__, module.__dict__)
File "./test_1.py", line 8, in <module>
print("a[i, j] = {}".format(a[i, j]))
File "<prefix>/lib/python3.8/site-packages/legate/numpy/array.py", line 381, in __getitem__
shape=None, thunk=self._thunk.get_item(key, stacklevel=2)
File "<prefix>/lib/python3.8/site-packages/legate/numpy/deferred.py", line 343, in get_item
index_array = self._create_indexing_array(
File "<prefix>/lib/python3.8/site-packages/legate/numpy/deferred.py", line 332, in _create_indexing_array
raise NotImplementedError("need support for concatenating arrays")
NotImplementedError: need support for concatenating arrays
index type: (<class 'numpy.int64'>, <class 'numpy.int64'>);
When doing a deep copy (using Python's native copy
module), an error is raised due to a missing argument in the call signature of array.__deepcopy__
:
https://github.com/nv-legate/legate.numpy/blob/2b460c5dfdd60b673e37e25231bf625fdf3ead0e/legate/numpy/array.py#L303-L306
test.py
:
import copy
from legate import numpy
a = numpy.arange(100)
b = copy.deepcopy(a)
legate --cpus 1 ./test.py -lg:numpy:test
Traceback (most recent call last):
File "<prefix>/lib/python3.8/site-packages/legion_top.py", line 394, in legion_python_main
run_path(args[start], run_name='__main__')
File "<prefix>lib/python3.8/site-packages/legion_top.py", line 193, in run_path
exec(code, module.__dict__, module.__dict__)
File "./test.py", line 4, in <module>
b = copy.deepcopy(a)
File "<prefix>/lib/python3.8/copy.py", line 153, in deepcopy
y = copier(memo)
TypeError: __deepcopy__() takes 1 positional argument but 2 were given
Though it's not shown in the traceback, but once I add a second argument memo
to the definition of array.__deepcopy__
, the error is gone. (i.e., changing line 303 in array.py from def __deepcopy__(self):
to def __deepcopy__(self, memo):
) However, I don't know if the result of the copying is correct or not.
According to the last paragraph of copy
's documentation, an additional argument (additional to self
) is required in __deepcopy__
:
In order for a class to define its own copy implementation, it can define special methods __copy__() and __deepcopy__(). ... ... The latter is called to implement the deep copy operation; it is passed one argument, the
memo
dictionary. ...
Though ndarray
has its own .copy()
method, but the __copy__
and __deepcopy__
from native Python is still useful. For example, when using a class:
class DummyMesh:
def __init__(self, bg, ed, n):
self.vertices = numpy.linspace(bg, ed, n+1)
it's easier to copy an instance using copy.deepcopy
, e.g., grid_a = DummyMesh(0., 1., 10); grid_b = copy.deepcopy(grid_a)
. In this situation, the __deepcopy__
of the ndarray
is triggered. Otherwise, users have to write more lines of code just to make a deep copy of the instance grid_a
.
From the definition of the shallow copy array.__copy__
(line 298) and deep copy array.__deepcopy__
, it seems they both are doing the same thing. Is this also the default behavior in vanilla NumPy? Just curious about this.
Summary:
There are 2 instances of what appears to be a comparison of a complex/bool/float (generic 'auto' variable) to zero in unary/scalar_unary_red_omp.c
. g++ and clang++ both fail, saying there's no match for the !=
operator with the provided types.
unary/scalar_unary_red_omp.cc:130:77: error: no match for ‘operator!=’ (operand types are ‘const std::complex<float
>’ and ‘int’)
I've built legate.core (without cuda, see 'aside' at bottom of post) from source, have a pre-installed openblas, and get this error when building legate-numpy on both OSX and Ubuntu
Instances of zero comparison (unary/scalar_unary_red_omp.cc). Both trigger compiler errors
1:
130 | for (size_t idx = 0; idx < volume; ++idx) locals[tid] += inptr[idx] != 0;
2:
137 for (size_t idx = 0; idx < volume; ++idx) {
138 auto point = pitches.unflatten(idx, rect.lo);
139 locals[tid] += in[point] != 0;
Command
On ubuntu (OSX command is similar):
python3 install.py --with-core /home/shivneural/legate/legate.core/target --with-openblas /usr/lib/x86_64-linux-gnu/openblas-pthread/
Environment:
I'm encountering this error on both OSX and Ubuntu (and have tried a few different compilers):
OSX 10.15.7 Catalina
Compilers tried:
a. clang++ version 12.0.0.
b. clang++ Apple LLVM version 7.0.2 (clang-700.1.81)
c. g++-11 (Homebrew GCC 11.1.0) 11.1.0 (fails due to different errors).
Ubuntu 20.04.2 LTS (Focal Fossa)
Compilers tried:
a. g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0.
Full error:
g++ -o unary/unary_red_omp.cc.o -c unary/unary_red_omp.cc -fopenmp -I/home/shivneural/legate/legate.core/install/
thrust -I. -I/usr/lib/x86_64-linux-gnu/openblas-pthread/include -std=c++14 -Wfatal-errors -I/home/shivneural/legat
e/legate.core/target/include -O2 -fno-strict-aliasing -DLEGATE_USE_CUDA -I/include -fPIC -DLEGATE_USE_OPENMP
unary/scalar_unary_red_omp.cc: In instantiation of ‘void legate::numpy::ScalarUnaryRedImplBody<legate::numpy::Varia
ntKind::OMP, legate::numpy::UnaryRedCode::COUNT_NONZERO, CODE, DIM>::operator()(uint64_t&, legate::AccessorRO<typen
ame legate::LegateTypeOf<CODE>::type, DIM>, Legion::Rect<N>&, const legate::numpy::Pitches<(DIM - 1)>&, bool) const
[with legate_core_type_code_t CODE = COMPLEX64_LT; int DIM = 1; uint64_t = long unsigned int; legate::AccessorRO<t
ypename legate::LegateTypeOf<CODE>::type, DIM> = Legion::FieldAccessor<LEGION_READ_PRIV, std::complex<float>, 1, lo
ng long int, Realm::AffineAccessor<std::complex<float>, 1, long long int>, false>; typename legate::LegateTypeOf<CO
DE>::type = std::complex<float>; Legion::Rect<N> = Realm::Rect<1, long long int>]’:
./unary/scalar_unary_red_template.inl:132:75: required from ‘legate::numpy::UntypedScalar legate::numpy::ScalarUn
aryRedImpl<KIND, legate::numpy::UnaryRedCode::COUNT_NONZERO>::operator()(legate::numpy::ScalarUnaryRedArgs&) const
[with legate_core_type_code_t CODE = COMPLEX64_LT; int DIM = 1; legate::numpy::VariantKind KIND = legate::numpy::Va
riantKind::OMP]’
/home/shivneural/legate/legate.core/target/include/utilities/dispatch.h:67:40: required from ‘constexpr decltype(
auto) legate::inner_type_dispatch_fn<DIM>::operator()(legate::LegateTypeCode, Functor, Fnargs&& ...) [with Functor
= legate::numpy::ScalarUnaryRedImpl<legate::numpy::VariantKind::OMP, legate::numpy::UnaryRedCode::COUNT_NONZERO>; F
nargs = {legate::numpy::ScalarUnaryRedArgs&}; int DIM = 1; legate::LegateTypeCode = legate_core_type_code_t]’
/home/shivneural/legate/legate.core/target/include/utilities/dispatch.h:141:41: required from ‘constexpr decltype
(auto) legate::double_dispatch(int, legate::LegateTypeCode, Functor, Fnargs&& ...) [with Functor = legate::numpy::S
calarUnaryRedImpl<legate::numpy::VariantKind::OMP, legate::numpy::UnaryRedCode::COUNT_NONZERO>; Fnargs = {legate::n
umpy::ScalarUnaryRedArgs&}; legate::LegateTypeCode = legate_core_type_code_t]’
./unary/scalar_unary_red_template.inl:167:27: required from ‘legate::numpy::UntypedScalar legate::numpy::scalar_u
nary_red_template(legate::TaskContext&) [with legate::numpy::VariantKind KIND = legate::numpy::VariantKind::OMP]’
unary/scalar_unary_red_omp.cc:151:61: required from here
unary/scalar_unary_red_omp.cc:130:77: error: no match for ‘operator!=’ (operand types are ‘const std::complex<float
>’ and ‘int’)
130 | for (size_t idx = 0; idx < volume; ++idx) locals[tid] += inptr[idx] != 0;
| ~~~~~~~~~~~^~~~
compilation terminated due to -Wfatal-errors.
make: *** [/home/shivneural/legate/legate.core/target/share/legate/legate.mk:200: unary/scalar_unary_red_omp.cc.o]
Error 1
Changing the 0
to a std::complex<float>(0.0f,0.0f)
emits an error that the operand types are a bool and complex float.
Perhaps I'm configuring something incorrectly, in which case any guidance is appreciated.
(Aside: My Ubuntu machine is GCP instance w/ a T4 GPU, running cuda 10.1. When kicking of a legate-core with-cuda
build it fails as it can't recognize the "_habs" half precision function when building legion
legate/legate.core/legion/runtime/mathtypes/half.h(364): error: identifier "__habs" is undefined
. Looks like T4's Turing architecture isn't one of legate's supported platforms, but AFAIK Turing supports half precision)
Thanks
When run in debug mode, some tests fail with the following error:
[0 - 7f8847346700] 1.473962 {5}{runtime}: [error 67] LEGION ERROR: Invalid mapper output from invocation of 'map_task' on mapper NumPy Mapper on Node 0. Mapper specified instance that does not meet region requirement 2 for task legate::numpy::BinaryUniversalFunction<legate::numpy::AddOperation<double> >::NormalTask (ID 133). The index space for the instance has insufficient space for the requested logical region. (from file /legate.core/legion/runtime/legion/legion_tasks.cc:3149)
For more information see:
http://legion.stanford.edu/messages/error_code.html#error_code_67
In NumPy, basic slices can lie partially or completely outside the bounds of an array, and the out-of-bounds indices are simply ignored. However, requesting out-of-bounds indices is an error when using advanced indexing:
>>> np.arange(10)[ 12:14 ]
array([], dtype=int64)
>>> np.arange(10)[ [12,13] ]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: index 12 is out of bounds for axis 0 with size 10
Currently in Legate such cases of advanced indexing are implemented using gather/scatter copies, where out-of-bounds indices are currently ignored, so we are not emitting an error for this case.
We have requested support for a check at the Realm level (see StanfordLegion/legion#1084), but even when this is implemented we may want to avoid it, as copies will be much faster without it.
Currently every array copy statement is translated into a single Legion copy operation or a single task launch. If the LHS and RHS in the copy statement refer to the same array and the slices overlap, e.g. for a[0:2] = a[1:3]
, then we will get a runtime error due to aliasing of the region requirements in the emitted operation.
This operation works fine in vanilla NumPy. To make it work in legate.numpy we would need to copy the RHS into an intermediate array, and copy from that into the LHS.
For basic copy statements it is possible we can come up with a cheap check to accurately detect overlap, and thus decide if the intermediate array is necessary (see #39). For advanced copy statements, however, this check would be much more expensive (since the set of affected indices is data-dependent), thus we should always use an intermediate array.
Each supported function should be fully exercised in the test suite, e.g.:
where
array, if supported by the operationTo keep things sane, each of the above parameters can be tested in isolation.
To cover an arbitrary amount of dimensions it will be necessary to programmatically generate inputs, e.g. see https://github.com/nv-legate/legate.numpy/blob/896f4fd9b32db445da6cdabf7b78d523fca96936/tests/binary_op_broadcast.py and https://github.com/nv-legate/legate.numpy/blob/067a541905bf3bfc8d3727c6e1fe97a4855729b9/tests/intra_array_copy.py.
The NumPy test suite may be a good starting point, see #22.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.