dmlc / dlpack Goto Github PK
View Code? Open in Web Editor NEWcommon in-memory tensor structure
Home Page: https://dmlc.github.io/dlpack/latest
License: Apache License 2.0
common in-memory tensor structure
Home Page: https://dmlc.github.io/dlpack/latest
License: Apache License 2.0
The DLPack protocol doesn't seem to have a representation of complex numbers. I don't know whether this is intentional (e.g., is the intent that clients use a structure-of-arrays representation of the real and imaginary parts?) or whether it is a missing feature.
(I therefore chose to leave this case unimplemented when adding DLPack support to JAX, but it would be nice to add it once the protocol is fixed.)
The Github 'About' description reads
RFC for common in-memory tensor structure and operator interface for deep learning system
I believe this could be updated to remove RFC now that dlpack is accepted across many communities.
Updated About proposal:
Common in-memory tensor structure and operator interface for deep learning systems
A related issue in onnxruntime: microsoft/onnxruntime#4162. Not sure where this feature request should belong
There is DenseTensor type in Microsoft ML/onnxruntime. It would be good to have standard example wrapper for converting to and from DLPack. Then users could use the provided structs in their own C# wrappers for their own C function wrappers that accept or return DLPack tensors
A good example of this could be onnxruntime wrapper for C#: https://github.com/microsoft/onnxruntime/blob/3530ce541cbb66f05e523f92b62cebaa4793bd3f/csharp/src/Microsoft.ML.OnnxRuntime/NativeMethods.cs and https://github.com/microsoft/onnxruntime/tree/3530ce541cbb66f05e523f92b62cebaa4793bd3f/csharp/src/Microsoft.ML.OnnxRuntime
(I may upload a sample wrapper later on, but I don't have it yet)
/*!
* \brief strides of the tensor,
* can be NULL, indicating tensor is compact.
*/
int64_t* strides;
Does not say if the strides are in bytes or number of elements.I believe numpy
uses the former and pyTorch
uses the latter. There does not seem to be an obvious convention.
Could you please generate a new version/release for this project?
This RFC proposes to rename the DLContext
to DLDevice
. DLContext
indicates a device for Tensor and ops. Two main reasons for this change:
device
to represent the device to run. I think it’d be good that we use similar terminology to reduce the confusion.This change will require downstream projects to change accordingly. Feel free to bring up your thoughts and concerns.
Meanwhile, we have a similar RFC for the name change in TVM discuss forum.
Hi @tqchen, I've tried to follow original discussion mentioned in README, but got lost somewhere in the middle.
Idea of reusing operations across frameworks sounds good, but what is the benefit of this for
frameworks developers?
final users (I assume here users of DL frameworks, but there are people who use tensor libraries directly)?
what is the relation with non-DL tensor frameworks (Numpy, Blitz++ are examples of such)? Will those be able to use operations from other frameworks?
Probably pointing in README such potential benefits is a good idea.
I made an experimental wrapper: https://github.com/vadimkantorov/pydlpack/blob/master/dlpack.py#L107
The most difficult part is managing memory / capsules. Currently it's sort of move-semantics (and deallocation is done in C). I'm sure you'd be able to do it better.
It would be a nice illustration in addition to existing borrowing from NumPy
A more complete usecase of mine: https://github.com/vadimkantorov/readaudio
dlpack/include/dlpack/dlpack.h
Lines 38 to 47 in 24b4f92
According to this comment:
// kCPUPinned = kCPU | kGPU
It seems these constants are defined as bit-field flags, and they can be combined with each other. However, if they are bit flags, only one of kMetal
, kVPI
and kROCM
should be setting 0b00001000
. If they are simple enum values, these values shouldn't be logically operated; a piece of description seems better than the comment shown above; and skipped values (5, 6, 7) are misleading.
If they are defined as bit flags, we can combine them together as results of machine capability query. But currently the possibility we use multiple *PUs (other than CPU+GPU) or APIs in a single NN task is quite low. Maybe in some deep feature extraction tasks it can be helpful.
Dear DLPack authors,
DLPack is widely used to enable interoperability between large C++-based frameworks that furthermore provide Python bindings. The C++ parts of such a framework are often usable through an independent C++ API (e.g. in the case of PyTorch, Tensorflow, ..), where functions can be called from multi-threaded code. In contrast, multithreading in Python is much more restricted: any use of the CPython API requires that the Global Interpreter Lock (GIL) is held.
This discrepancy has implications on projects mixing C++ and Python code. Function in such a mixed C++/Python codebase should clarify require whether the caller should ensure that the GIL is held.
This is currently unclear in the case of DLManagedTensor::deleter
, which could be called by the PyCapsule
destructor (where the GIL is held) or from arbitrary C++ code at some later point following "consumption" of the capsule (where the GIL is not necessarily held -- this destructor call could even occur from a different thread!)
I don't have any strong opinions either way, but ideally the documentation of the interfaces should say so.
Thanks,
Wenzel
This issue is used to status from framework side on adopting the data structure. So far during GTC @mli have talked to @Yangqing and @soumith and they declared joint effort on this.
Currently, we are using kFloat
, kUInt
for types and kGPU
, kOpenCL
for device types.
While these constant are fine if they sit in a standard namespace like mxnet, since DLPack is a global C structure that sits on global namespace, it might make sense to change the naming convention:
This might help avoid possible namespace conflicts with the existing packages and also makes it clear that constants come from DLPack. This will need an upgrade of the dependent frameworks though, and we would need to tag another release after this change. Possibly with #18
Any thoughts?
DLDevice definition in dlpack:
typedef struct {
/*! \brief The device type used in the device. */
DLDeviceType device_type;
/*! \brief The device index */
int device_id;
} DLDevice;
t_tvm_device_ definition in codegen_cpu.cc@tvm
t_tvm_device_ = llvm::StructType::create({t_int_, t_int_});
if dlpack was built with gcc -fshort-enums
, then sizeof(DLDeviceType) will be 1 instead of 4, which will cause some errors when accessing device_type from llvm. one case I found is arg_binder.cc@tvm
Bind_(device_type, TVMArrayGet(DataType::Int(32), handle, builtin::kArrDeviceType),
arg_name + ".device_type", true);
add one dummy device type with value 0xffffffff in DLDeviceType should fix this issue
Similar to kDLCPUPinned
, CPU shared memory is allocated and managed quite differently from normal CPU memory. In DGL, we use shared memory to store large graphs so it is quite common to see operation to copy data between CPU memory and shared memory. I guess we could also use the extension type to handle that, but want to bring up this issue to see whether it is better to have a type for that in DLPack.
cc @zheng-da
Dear DLPack community:
After quite a bit of discussions and coordinations, we are planning to do a ABI breaking change to add versioning and read only field to DLPack. DLPack has been a minimum stable ABI for exchanging Array data and we would like for it to continue stay that way.
In the meantime, we would like to have opportunities to carefully evolve DLPack, of course in still carefully considered manner. After long discussions, we have decided to make the following change.
DLManagedTensorVersioned
, which contains a new version fieldWe also propose to change Data API exchange protocol, to allow new versions of DLPack to return capsule with name "vdltensor"(instead of the old "dltensor"), this would .
The change is still ABI compatible, as the new data structure ABI comes with the new class DLManagedTensorVersioned. The data api however would involve an ABI update and update of all DLPack importer/exporters.
Such a move certainly impact a lot of packages and we would like to plan it carefully, as a result, we would like to have at least one month of notice to let everyone chime in, also see if we have enough volunteers to help update the data api exchanges in various packages.
struct DLManagedTensor {
DLTensor dl_tensor;
void * manager_ctx;
void (*deleter)(struct DLManagedTensor * self);
};
/*!
* \brief The DLPack and DLPack ABI versions of the tensor.
*/
typedef struct {
/*! \brief DLPack version. */
uint32_t dlpack;
/*! \brief DLPack ABI version. */
uint32_t abi;
} DLPackVersion;
struct DLManagedTensorVersioned {
DLPackVersion version;
void * manager_ctx;
void (*deleter)(struct DLManagedTensorVersioned * self);
uint64_t flags;
DLTensor dl_tensor;
}
typedef DLPACK_BIT_MASK_READ_ONLY 1
DLPack only specifies a C++ API, but in practice there's a Python embedding that multiple frameworks support (via Python capsules) that does not seem to be formally specified or standardized.
The protocol seems to be:
DLPackManagedTensor
as Python capsule with name "dltensor"
.DLPackManagedTensor
, it renames the capsule to "used_dltensor"
so the same capsule cannot be consumed twice.I'm arising this issue for dlpack bfloat16 support.
Bfloat16 is a popular 16-bit floating point format for machine learning, supported by multiple hardware, e.g. TPU. Compared to fp16, bfloat16 has a greater dynamic range, so it's useful for things like gradients that can be outside the dynamic range of fp16. Compared to fp32, Using bfloat16 reduces the size of data in memory and allows larger models to fit in the same amount of memory. So there are many advantages of bfloat16, it's a trend for different frameworks to support bfloat16. Tensorflow has already supported bfloat16 data type. And we are now supporting bfloat16 in MXNet.
Dlpack is an open in-memory tensor structure for sharing tensors among deep learning frameworks. Supporting bfloat16 can make dlpack more flexible and integrated in data sharing between different frameworks.
Pytorch has two interfaces for converting data from/to dlpack format. tsor = torch.utils.dlpack.from_dlpack(dl)
converts data from dlpack-defined tensor to pytorch-defined tensor. dl = torch.utils.dlpack.to_dlpack(tsor)
converts data from pytorch-defined tensor to dlpack-defined tensor. And when using to_dlpack
function, getDLDataType
is used to check the data types that have been enabled for data sharing in dlpack:
DLDataType getDLDataType(const Tensor& t) {
DLDataType dtype;
dtype.lanes = 1;
dtype.bits = t.element_size() * 8;
switch (t.scalar_type()) {
case ScalarType::Byte:
dtype.code = DLDataTypeCode::kDLUInt;
break;
case ScalarType::Char:
dtype.code = DLDataTypeCode::kDLInt;
break;
case ScalarType::Double:
dtype.code = DLDataTypeCode::kDLFloat;
break;
case ScalarType::Float:
dtype.code = DLDataTypeCode::kDLFloat;
break;
case …
case ScalarType::BFloat16:
throw std::logic_error("BFloat16 is not supported by dlpack");
break;
For now as dlpack has not supported bfloat16 yet, getDLDataType
throws an error when encountering bfloat16 data type. Once dlpack supports bfloat16 data type, this code can be easily changed.
Similar to pytorch, mxnet also has arr = mx.nd.from_dlpack(dl)
, dl = mx.nd.to_dlpack_for_read(arr)
and dl = mx.nd.to_dlpack_for_write(arr)
for dlpack/mxnet data sharing. Also DTypeTransform
is used to check the data types.
static DLDataType DTypeTransform(int type_flag) {
switch (type_flag) {
case mshadow::kFloat32: return DLDataType{kDLFloat, 32, 1};
case mshadow::kFloat64: return DLDataType{kDLFloat, 64, 1};
case mshadow::kFloat16: return DLDataType{kDLFloat, 16, 1};
case mshadow::kBfloat16: return DLDataType{kDLBfloat, 16, 1}; // add this line to support bfloat16
case ......
}
}
add bfloat16 data type support in this function and then we can use this data type as inputs, params or outputs for operator computation.
Tensorflow haven't support dlpack yet, but there's a discussion on it (issue). Tensorflow has already support bfloat16.
As discussed above, we can see that bfloat16 has a good support in various frameworks. On the other hand, dlpack is also becoming more and more popular. So it will be really great if dlpack can have bfloat16 data type support.
Here is a draft proposal for supporting bfloat16 in dlpack. the modification in dlpack will be very simple, just add one single line in DLDataTypeCode:
typedef enum {
kDLInt = 0U,
kDLUInt = 1U,
kDLFloat = 2U,
kDLBfloat = 3U, // add this line to support bfloat16
} DLDataTypeCode;
And it's done.
Do you have any ideas? Thank you @soumith @piiswrong @Yangqing @naibaf7 @bhack @edgarriba @tqchen @prigoyal @zdevito @pengzhao-intel @ZhennanQin
I noticed whilst comparing versions that a couple of comments that were OK in version 0.0 somehow became nonsensical along the way by v0.6
This trivial patch fixes the grammar.
--- dlpack-0.6.orig/README.md
+++ dlpack-0.6/README.md
@@ -2,14 +2,14 @@
[![Build Status](https://github.com/dmlc/dlpack/actions/workflows/main.yaml/badge.svg?branch=main)](https://github.com/dmlc/dlpack/actions/workflows/main.yaml)
-DLPack is an open in-memory tensor structure to for sharing tensor among frameworks. DLPack enables
+DLPack is an open in-memory tensor structure for sharing tensors among frameworks. DLPack enables
- Easier sharing of operators between deep learning frameworks.
- Easier wrapping of vendor level operator implementations, allowing collaboration when introducing new devices/ops.
- Quick swapping of backend implementations, like different version of BLAS
- For final users, this could bring more operators, and possibility of mixing usage between frameworks.
-We do not intend to implement of Tensor and Ops, but instead use this as common bridge
+We do not intend to implement Tensor and Ops, but instead use this as common bridge
to reuse tensor and ops across frameworks.
## Proposal Procedure
Following #17
DLPack so far does not contain memory management for tensors and only asks users to pass non-managed DLTensor around, which serves our purposes so far. For each framework like PyTorch to ATen, and MXNet, there is a need for managing these tensors. This can be done in several ways:
The major question is as follows:
Just call it "multi/n-dimensional array". Tensor is a word with a very specific mathematical meaning. Calling any n-dimensional array a tensor is incorrect in the same way that calling every two-dimensional array a matrix is wrong. Numpy set a correct precedent. There's no reason to do worse than numpy.
From the CUDA side, nearly all of the DL frameworks and array libraries that support dlpack use CUDA streams and stream order both their computations and memory allocations. In its current form, dlpack doesn't specify any synchronization semantics as well as doesn't have a way to specify information to allow for a producer-consumer pair to exchange the necessary information to continue to stream order computations.
I imagine there's a similar problem in other contexts as well (OpenCL, ROCm, etc.) where maybe it's possible to generalize an approach.
This is a thread to bring awareness and discuss the recent set of proposed change of adding versioning information to DLPack.
Up until now the DLPack C struct itself does not carry ABI version information, and we always maintain backward compatibility. Up until now there is no ABI breaking changes.
The overall stability is one reason why frameworks adopt DLPack in the first place and we would like to continue to do moving forward. In the meantime, it is indeed helpful to clarify the DLPack ABI version in the data structure itself. contains a draft proposal for the change. We can still
In short, we are looking into the possibility of attaching version information to the DLTensor struct, to minimize the ABI change, the version information is appended in the end of the current DLTensor
__dlpack_info__
function that returns the current supported API/ABI version.One thing that is worth clarifying is that whether this constitute a ABI breaking change. It depends on how people use it. Let us clarify the following scenarios.
__dlpack_info__
then do the conversion according
__dlpack_info__
to indicate that it is already at a new versionNormally S0 means that the data structure ABI is still backward compatible (new data structure being used in old scenarios). S1 somewhat seats in the future-compatible regime(old data structure being used in new scenarios).
This is a notice and discussion thread to let the community that this is happening. It would be great to get folks to also tag related framework communities, so the ecosystem is fully aware of(and ideally buy-in) the proposed change before we formally proceed.
Given this change does have compatibility implications (although not breaking depending on how we see it), we will have a longer period of NOTICE(expect 1-2 months) before we proceed. Please tag as many folks you think are related as possible.
In the future, we also suggest to open thread of this kind to discuss the compatibility implication of the changes, if any.
If quad-precision takes off, then the current uint8
for the itemsize cannot represent the 128+128 = 256
bits of a complex quad number. Even now that would fail for most long double
storage formats (those cannot be represented right now, so I am not sure it matters).
If an ABI break is necessary in the future, maybe this should be bumped to uint16 (maybe some other ones as well)? I guess there will be other ways to work around the limitation, though. And I admit quad-precision complex may well be as bad as it gets and is just at the limit.
I’m making this issue here in DLPack, as I do not think of a better place for this. This issue is between many software and as this is the goal of DLPack. So here seem a good place for this issue. If you know of a better place, tell me.
It happen more and more frequently that in one script, multiple software that need the GPU are used. For optimization reasons, they all implement a memory allocator on top of cudamalloc.
Problems:
The fact that there is many memory allocator in the same experiment cause some problems like memory fragmentation.
Possible solution:
The problems could be solved by having all software(or most software) reuse a common allocator.
Do you agree we should spend time on this problem?
Would you agree to use a common allocator if we find a good proposition?
If so, what features do you need for that allocator?
Who else we should contact related to this?
In the recent discussions scattered everywhere, it appears that some functions should better be just implemented by DLPack so that the downstream libraries do not have to reinvent the wheels. The possibilities include
__dlpack__
and __dlpack_device__
in the Python Array API standard for handing streams (#65)__dlpack_info__
for returning API and ABI versions (and potentially more, see #34, #72)cc: @tqchen @rgommers @seberg @eric-wieser @kkraus14 @jakirkham @hameerabbasi @vadimkantorov @oleksandr-pavlyk @szha @veritas9872
We have not yet have an example operator interface, I would propose to have a dedicated issue for discussing this. Candidate interface posting is welcomed.
There are three categories of operators
The former ones can be relaxed to the later ones. In general putting operator into the most restrictive types leaves chances for the user to decide what to do with them. e.g. level 0 and 1 allows static memory planning
Should we put an example somewhere to illustrate that it is trivially easy to borrow numpy NDArrays to DLManagedTensor without copying the content? I happened to be working on this today.
hey guys what do you think about giving support to protobuf ?
I have a current need to share/log tensor data via python.
Wanted to discuss here before starting myself.
I image something like this soon in kornia
from dlpack import dlpack_pb2
import kornia as K
img: K.core.Image = K.io.read_image(...)
# img.to_proto() will convert tensor data (numpy/torch) to `dlpack_pb2`
K.io.save(..., img.to_proto())
img_proto: dlpack_pb2.DLTensor = K.io.load(....)
img_load: K.core.Image = Image.from_proto(img_proto)
To represent tensors of type bool - is it possible adding to the DLDataTypeCode
enum another type (kDLBool)? That can be very useful, as some well known machine learning platforms (TF, torch, Onnxruntime) support models whose inputs/outputs are tensors of type bool. Thanks!
This is an issue for releasing v0.2 . The issue is going to be opened for four days, if there is no objection from participants, a version will be tagged based on current master.
Dear DLpack authors,
I was curious why several definitions in dlpack.h
, specifically various DLTensor
attributes are signed, when negative-valued arguments would seem to indicate obviously nonsensical tensor configurations (such as negative dimensions or a negative shape along a dimension).
Would PR to change these to an unsigned counterpart be accepted? ABI-wise, there should be no impact as they occupy the same amount of memory (and values using the sign bit would, in any case, not correspond to valid configurations).
Thanks,
Wenzel
Follow-up of #50.
In #58, kDLComplex
was added to support "array of structs" based complex numbers, which is the compact memory layout used in C/C++/Python/etc. However, as pointed out in #50 there is a need to support the "struct of arrays" layout as well (a struct containing two pointers for real and imag). This issue is to track such a need.
The DLTensor structure is stablized for a while and since one major reason of DLPack is to be used across frameworks, I would recommend we tag release for major ABI version.
The first release is only going to be stable DLTensor structure.
As a follow-on to Issue #34, I propose the addition of describing the tensor format for the underlying data in the DLTensor struct.
This option would consume a single byte with one of two possible values: kDLRowMajor or kDLColumnMajor.
Add new device type to dlpack.h
The device name will be kDLAxelera
added to DLDeviceType enum
I have a couple of question, about what might be useful to include or modify.
What is the purpose of lanes
? Lanes seems like a way to describe access of data (i.e. alignment) and not the data itself. But since it affects the shape it is very limiting to use lanes != 1
(none of the Py-libs do or support it, I think)?
Unless lanes don't affect shape/strides, in which case they could convey other information.
How about adding alignment=256
or pow2_alignment=8
? An alignment larger than itemsize could indicate that it is valid to do vectorized reads of that size (may require the use of byte_offset
if the first reads starts at an offset?).
The current stream exchange allows the consumer to synchronize with the producer at time of consumption. If all the consumer wants to do is a computation like:
@compile_for_dlpack
def update_simulation(dlpack_array):
data = dlpack_array.__dlpack__(stream=s2) # synchronizes with "s2"
# launch work on s2
return # returns without synch of original stream.
# user code:
arr = MyArray()
for i in range(1000):
update_simulation(arr)
do_analysis(arr) # cannot be auto-synch'ed
cannot guarantee do_analysis
waits, unless update_simulation
does a full synchronization? (Maybe this is just not important?)
Is stream lifetime management difficult? I can see mylib.from_dlpack(arr)
synchronizing only once (although it might be nicer to be safer by default, but I don't know). But for the computational library use-case seems more relevant?
(I have read the the thread about introducing the stream=
API, but it is not really clear to me why the current scheme is much simpler)
It seems to me an undefined behavior in our protocol whether the deleter below deletes self
as well. The inconsistency causes a recent crash when using MXNet as a backend for DGL, reported by @zheng-da.
dlpack/include/dlpack/dlpack.h
Lines 163 to 167 in 5c792ce
I did a quick check and found that the deleters in TVM and DGL do free self
, but haven't verified it on PyTorch & Chainer side. Could someone help with this issue? Thanks!
This RFC proposes to rename kDLGPU
to kDLCUDA
, and kDLCPUPinned
to kDLCUDAHost
. Two main reasons for this renaming:
torch.cuda
to support CUDA tensor types, so this renaming will make it more consistent with the other frameworks.Look forward to hearing your thoughts!
Following up on #57 where we figured out the correct stream exchange and synchronization semantics and a Python interface for doing so, we need to do the same for a C interface.
TLDR from #57:
cc @tqchen @harrism @jrhemstad @rgommers @leofang @oleksandr-pavlyk @szha @veritas9872
So far DLPack has introduced a few primitive data types, int, float, uint and bfloat. However, we still want to be able to provide a type code for quick extension in a compatible way.
One potential proposal is to bring a OpaqueHandle
, type code, which means the target data is actually opaque, while the bitwidth and lanes are still specified. This type code could be used as code for testing data types that are not yet supported and allow frameworks to exchange the data as long as they agree on the dtype.
The contrib folder contains headers that are specified as inferface as the dlpack library https://github.com/dmlc/dlpack/blob/master/CMakeLists.txt#L66
but not installed to the approperate folder during the install step: https://github.com/dmlc/dlpack/blob/master/CMakeLists.txt#L113
DLPack has this comment in the header file:
/*!
* \brief The data pointer points to the allocated data. This will be CUDA
* device pointer or cl_mem handle in OpenCL. It may be opaque on some device
* types. This pointer is always aligned to 256 bytes as in CUDA. The
* `byte_offset` field should be used to point to the beginning of the data.
*
* Note that as of Nov 2021, multiply libraries (CuPy, PyTorch, TensorFlow,
* TVM, perhaps others) do not adhere to this 256 byte aligment requirement
* on CPU/CUDA/ROCm, and always use `byte_offset=0`. This must be fixed
* (after which this note will be updated); at the moment it is recommended
* to not rely on the data pointer being correctly aligned.
* ...
*/
This was discussed in data-apis/array-api#293 and came up in NumPy numpy/numpy#20338.
This comment by @rgommers summarizes the issue very well. Quoting the options in the comment:
These are the options:
A1: required alignment. Require the
data
pointer to always be aligned (using nonzerobyte_offset
), and do the gradual evolution plan in my comment above.A2: no alignment. remove the allocation requirement completely from
dlpack.h
. no library needs to make any changes (except if current handling ofbyte_offset
is buggy, like @seberg pointed out for PyTorch). NumPy and other new implementers then just usebyte_offset=0
always (easiest), and we're done.A3: optional alignment. Do not require alignment, but add a way to communicate from the producer to the consumer what the alignment of the data is.
The current status is that the fine print in
dlpack.h
requires alignment (option A1), but no one adheres to it or enforces it. This state is not very useful: it requires a >1 year evolution plan, and apparently there's no gain because of the third bullet above. So it looks like the best choices are either A2 or A3. A3 seems strictly better than A2, and most of the work it requires (versioning/extensibility) is work we wanted to do for other reasons already.So here's a new proposal:
Decide that the long-term desired state is A3: optional alignment
NumPy and other new implementers to do whatever is simplest, i.e. to use
byte_offset = 0
anddata
pointing to the first element in memory.Update the comment in
dlpack.h
about this topic to reflect: current state, desired future state, and a link to a new issue on the DLPack repo with more info (outcome of this discussion to be summarized on that issue).
I agree with @rgommers that A3 is the best option because: most libraries don't care about alignment and we can communicate the alignment to some that do using the __dlpack_info__
dunder or a request API under discussion in #34. If others agree, then let's add that to the spec and remove the comment from the header.
I noticed Tensorflow 2.8.0 crashed with NumPy 1.22.3 (on Ubuntu 20.04) due to an alignment issue:
import numpy, tensorflow
x = numpy.arange(5, dtype='int64')
tf_x = tensorflow.experimental.dlpack.from_dlpack(x.__dlpack__())
tf_x[1] # Fatal Failure
# 2022-03-15 18:29:10.587129: F ./tensorflow/core/framework/tensor.h:776] Check failed: IsAligned()
# Aborted
This happens for float16
, float32
, and int64
.
(This crash doesn't happen for Tensorflow 2.7.0 and NumPy 1.22.3 but for all other versions, it does)
So, it'd be good to communicate alignment to the importing library. I don't think it'd be difficult to calculate that at importer side but it could be useful for opaque pointers.
The DLDeviceType
enum enumerates the list of device types.
The TVM project defines another enumeration, TVMDeviceExtType
, that provides a supplemental set of devices / enumerators. It's important that there's no overlap of the integers provided by DLDeviceType
and TVMDeviceExtType
.
Unfortunately there's currently no good mechanism to notice when changes to either project lead to both using the same integer value in those enumerations.
We could address this by adding a sentinel value to DLDeviceType
, e.g.:
typedef enum {
kDLDeviceType_Begin = 1,
kDLCPU = kDLDeviceType_Begin,
...
kDLWebGPU = 15,
/*! \brief Qualcomm Hexagon DSP */
kDLHexagon = 16,
kDLDeviceType_End, // all DLDeviceType enumerators are guaranteed to be numerically lower than this integer
} DLDeviceType;
With this in place, TVM could safely avoid problems using something like this:
typedef enum {
kDLAOCL = kDLDeviceType_End,
kDLSDAccel,
kOpenGL,
kDLMicroDev,
kDLWebGPU,
// AddExtraTVMType which is not in DLPack here
} TVMDeviceExtType;
or this:
typedef enum {
kDLAOCL = ...,
...
} TVMDeviceExtType;
// Relies on kDLAOCL having the lowest integer value in TVMDeviceExtType.
static_assert(KDLAOCL > kDLEnumEnd);
``
DLPack’s vision of a common in-memory tensor format that spans device and memory types is fantastic.
However, in its form, there is no upgrade path for adding new items to the either the DLTensor or DLManagedTensor structs in way that would maintain ABI compatibility.
I would like to propose the addition of two components to the DLTensor struct. This will break current ABI compatibility, but will in the long run future proof the design.
Additions to DLTensor
unt8/16_t version
uint64_t future_bytes
Adding the version allows the receiving library to determine if the DLTensor can be consumed. The receiver may not have a matching version, but as long as it knows the version it can make a decision on if the data can be correctly used.
Adding future_bytes
allows for the addition of new options to DLTensor. One of which might be data layout, ie row-major or column major (c-format vs FORTRAN). I will open a separate issue for this feature.
This is the root discussion issue for the proposal. I have give most related folks write access to the repo, but let us do it through PR so that they can be reviewed and discussed. Feel free to propose changes
Please consider releasing version 0.7 with kDLOneAPI
added to unblock pytorch/pytorch#78154
Complaints on endianness has been something I've recurrently seen (ex: CuPy cupy/cupy#3652 and mpi4py mpi4py/mpi4py#177), and I anticipate at some point we'd start receiving bug reports on this. Apparently there is at least a few communities out there (astropy and hdf5) that prefer (or could work with) non-native (that is, big) endianness data. This causes problems if two libraries exchange but do not communicate the endianness for how to interpret the data.
I suggest two possible solutions:
DLDataType
as a new struct member:
DLDataType::code
to make it carry this information:
migrated from apache/mxnet#4735
Hi, I have recently proposed an adoption of dlpack, however I am not sure my arguments are good.
It would be nice to have some list of advantages for frameworks/libraires who adopted dlpack.
So far I can see:
DLManagedTensor
)Something else?
From #67 (comment) and #67 (comment):
I am proposing to add two new device types, following the line of #67:
kDLROCMHost
kDLCUDAManaged
The first addition is to mirror the current relation (since v0.5) between CUDA and ROCm now that we have kDLCUDA
, kDLCUDAHost
, and kDLROCM
. ROCm also provides pinned/page-locked memory, so this is legit.
The second addition is for CUDA managed/unified memory, which does not belong to either host or device but to both. It seems natural to me to have a standalone type for it. ROCm currently does not provide managed memory, so we could add it in the future once AMD implements it.
Both additions seem straightforward for me to add without any issue, as they are orthogonal to existing device types (as they should be).
cc: @rgommers @tqchen @jakirkham @kkraus14 @kmaehashi @emcastillo @asi1024
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.