Comments (10)
Implemented in #228.
from dora.
Thanks for this detailed analysis!
Curious about when you say:
Unfortunately there is no way to figure out whether the data is no longer used.
This is because we can't listen on the python arrow itself, right? This is why you mentioned that we could eventually listen for a drop of the event object itself. ( with a one-shot Tokio channel for ex.)
In the case we listen for a drop of the event object, wouldn't we be able to know when the data is no longer used?
from dora.
Unfortunately, most of these types are unsuitable for our use-case, since they assume that the data is stored on the heap
I have to check again but from what I remember, the unsafe slice gives you a Array<u8>
that you can transmute for free to other types like array<f64>
and the likes.
from dora.
In the case we listen for a drop of the event object, wouldn't we be able to know when the data is no longer used?
Dropping the event is not enough since the user can take the data out of input events and store them somewhere else (e.g. in some list). So we really have to wait until the data is dropped too. With Arrow, this is signaled through the release
callback, which is part of the data format.
The issue with the arrow2
crate is that it does not support to plug your own logic into the release
callback.
Unfortunately, most of these types are unsuitable for our use-case, since they assume that the data is stored on the heap
I have to check again but from what I remember, the unsafe slice gives you a
Array<u8>
that you can transmute for free to other types likearray<f64>
and the likes.
The issue with the ffi::mmap::slice
method is that it does not take any ownership of the data. So it will set the release
callback to a no-op, which effectively requires us to keep the data allocated forever (since there is no way to find out whether it's safe to be freed).
from dora.
I see your point. Thanks for the clarification! I'll try to do some test as well next week.
from dora.
To give some more details:
- The underlying type for all the arrow array implementations is called
Buffer
in botharrow2
andarrow
- For both crates, this
Buffer
type has anArc
field of typeBytes
, which stores the actual data- For the
arrow
crate,Bytes
is a custom struct, with a pointer, a length, and aDeallocation
field that defines how to clean up the array. - For
arrow2
, thisBytes
type is an alias for aForeignVec
with the owner set toInternalArrowArray
, which contains a reference to the actual FFI-compatible arrow array type.
- For the
- We want to set a custom deallocation logic:
- For
arrow
: The nice thing is that theDeallocation
parameter can be set to some custom deallocation logic through theowner
argument ofBuffer::from_custom_allocation
- For
arrow2
we don't have such an option. However, thePrivateData
struct (which the library frees on drop) has a genericarray: Box<dyn Array>
field. So maybe we could implement theArray
trait for a custom type of ours, which performs the necessary drop steps? Unfortunately, this does not work either because the library assumes that theArray
trait can always be downcasted to some known array type. So using e.g.export_array_to_c
with a customArray
implementation will result in a panic.
- For
from dora.
Could you clarify why you chose the arrow2
crate instead of arrow
in the first place? Is there any functionality missing from the arrow
crate?
Also, it looks like there are some proposals to merging arrow2
and arrow
: jorgecarleitao/arrow2#1429 and apache/arrow-rs#1176.
from dora.
Yeah, I thought that arrow/Buffer
required the ownership of the data but it seems that as you mentioned you could do it with [from_custom_allocation](https://docs.rs/arrow/35.0.0/arrow/buffer/struct.Buffer.html#method.from_custom_allocation)
.
In many ways the arrow2 crate seemed more easy to work with vec
and slice
. But if this deallocation can only be done with arrow
, there is no issue with using it.
from dora.
In any case, it should be simple to change from arrow2 to arrow as they both can read c pointers as input to make a array. I can add some comments to go from one to the other if you need.
Also, isn't the deallocation method of an array that we have built be called at the end of its lifetime, which in our case is when we export it to a python arrow array, and not when it's not used within python?
from dora.
In any case, it should be simple to change from arrow2 to arrow as they both can read c pointers as input to make a array. I can add some comments to go from one to the other if you need.
Yeah, I think it shouldn't be too difficult to switch from arrow2
to arrow
. The challenge will probably be to set up the Deallocation
field correctly. I try to look into it today.
Also, isn't the deallocation method of an array that we have built be called at the end of its lifetime, which in our case is when we export it to a python arrow array, and not when it's not used within python?
This depends on whether the created arrow array only borrows the data or whether it takes ownership. The arrow2::ffi::mmap::slice
function only borrows the data so the original array is dropped at the end of the scope as usual (which caused a segfault in the Python code). The arrow::buffer::Buffer::from_custom_allocation
function will take ownership of the array instead, so the drop will happen once the last copy of the data is dropped (reference-counted using Arc
).
When exporting the array to C through the FFI_ArrowArray::new
function, the reference count will be increased by one. The idea is that the Python code will invoke the release
callback once it's done with the data, which decreases the reference count and drops the data once the reference count reaches 0. (We might need an additional mem::forget
when doing the conversion as it looks like the FFI_ArrowArray
has a Drop
implementation that calls release
itself.)
from dora.
Related Issues (20)
- Accidentally dora command is unresponsive and stuck HOT 16
- The python installation package of dora 0.2.2 is abnormal on the windows system, and dora-drives cannot recognize the dora installation package HOT 9
- The process cannot be destroyed by the command `dora destroy` HOT 2
- Dora adds support for golang HOT 2
- todo: dora provides Chinese documents HOT 4
- todo: Add survival status check command `dora status` for dora HOT 4
- Dataflow with no specified name shows unnamed HOT 3
- Error: failed to send start dataflow message HOT 3
- Reduce number of open file handles
- Failing operator initialization make the dataflow run forever
- error while loading shared libraries: libpython3.7m.so.1.0: cannot open shared object file: No such file or directory HOT 2
- error[E0554]: `#![feature]` may not be used on the stable release channel HOT 2
- warning: variant `AllocateOutputSample` is never constructed HOT 2
- no daemon listen port for machine `` HOT 1
- The daemon is dead and cannot be restarted automatically HOT 5
- failed to stop dataflow HOT 4
- the following packages contain code that will be rejected by a future version of Rust: ntapi v0.3.7 HOT 2
- no node exists at `../../target/debug/rust-dataflow-url-example-sink`: No such file or directory HOT 3
- WARN dora_daemon::pending: node `no_webcam` exited before initializing dora connection HOT 1
- WARN dora_daemon::pending: node `webcam` exited before initializing dora connection HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dora.