Currently any TraceAgent , the thing you hold on to to

Remove `distinguish_since` capability about differential-dataflow HOT 8 OPEN

timelydataflow commented on July 30, 2024 1

Remove `distinguish_since` capability

from differential-dataflow.

Comments (8)

utaal commented on July 30, 2024

If we were to keep them both around, what about merge_batches_until and distinguish_(timestamps_)since?

Other options for advance_by:

keep_history_since;
advance_timestamps_until/to.

from differential-dataflow.

frankmcsherry commented on July 30, 2024

I started on these edits, and one complication shows up (at least one):

It is currently possible to have an Arranged source of data, meaning stream of batches and indexed trace, where all of the trace handles have been dropped and so the trace is no longer maintained. An example of this is in the join operator which wants each input arranged, but only needs to maintain the history if the opposite stream is still active. If an opposite stream is terminated (corresponding to now immutable data), the first stream can cease its trace maintenance (it should still send arranged batches along, but needn't maintain (and merge) a list of historical batches).

This worked in part because access to the trace was mediated through the trace handle, and by dropping it there was no other route for trace data to arrive. If instead we ship (Batch, Vec<Batch>) data along channels, ... we need to be clear about what this all means.

There is nothing fundamentally hard about doing this, in that we can use the same signals (trace handle drop) to clean up the trace, at which point we have no batches to fill up a Vec<Batch> which .. is fine as long as the recipient understands that having dropped the trace handle these are no longer valid history (i.e., an empty list does not indicate an empty history, just an absent history).

This doesn't seem hard to address "by convention", but it does signal that we are probably planning on accessing trace data without using the help of the trace. We could instead ship (Batch, Option<Trace::Cursor::Storage>), which would still require the trace's help to get a cursor, and only allow navigation if we are both received a Some(_) variant and have a trace handle from which to grab a cursor.

from differential-dataflow.

utaal commented on July 30, 2024

To have time to serialise batches for durability purposes (unless this is done at batch construction time, somehow), it may be helpful to keep the batches in advance of (the old) advance_by from being merged (while they're being written down): this way one could trivially write down all the new batches one by one, and just load them all as non-merged batches on recovery.

I may need a clarification: with the current (and proposed) interface, merging two batches a and b may result in a merged batch c with c.since > a.since and c.since > b.since, if the advance_by frontier is in advance of both a.upper and b.upper? (where since, upper as defined in Description)

from differential-dataflow.

frankmcsherry commented on July 30, 2024

I was guessing that serialization would probably be done at batch creation time, but I see your point if not. I have some other reservations about the distinguish_since API and the proposed fix, so let's think out what we would want from it.

I like a "bookmark" sort of feature, where you can hold on to a capability for a batch boundary (e.g. a lower or an upper) that prevents merging across that boundary. I think that this does something like what serialization would want (e.g. "I've serialized up to upper; please don't merge across that, but feel free to merge among batches after it"). Does that sound believable to you?

Edit: Alternately / similarly, the capability might not prevent merging but ensure that you can get access to the trace split at this point (e.g. when we merge across, the source batches aren't dropped as the capability could hold a reference to them). This made more sense when the capability was for "batches up to here", and maybe makes less sense for "batches after here, cleanly separated".

from differential-dataflow.

utaal commented on July 30, 2024

I like a "bookmark" sort of feature, where you can hold on to a capability for a batch boundary (e.g. a lower or an upper) that prevents merging across that boundary. I think that this does something like what serialization would want (e.g. "I've serialized up to upper; please don't merge across that, but feel free to merge among batches after it"). Does that sound believable to you?

Would this also ensure that since ≤ upper? I.e. prevent merging and timestamp advancement across that boundary?

from differential-dataflow.

frankmcsherry commented on July 30, 2024

I may need a clarification: with the current (and proposed) interface, merging two batches a and b may result in a merged batch c with c.since > a.since and c.since > b.since?

I think that when you merge two batches, you need to advance the since bound to not make any promises that the merged data cannot provide. What probably happens at the moment (and the code tests for) is that one of a.since and b.since should be in advance of the other (because the trace's frontier used for them only advances), and we pick the more advanced one. Otherwise, we would need to pick a frontier in advance of both of them.

from differential-dataflow.

frankmcsherry commented on July 30, 2024

Would this also ensure that since ≤ upper? I.e. prevent merging and timestamp advancement across that boundary?

No, that was not the intent. If you want to prevent advancement, you can indicate that with a trace handle by not advancing its advance_by (gah, the names...).

Perhaps I should try and clear up an unspoken desideratum, which is that holding these capabilities shouldn't cripple the other users of the trace by preventing e.g. merging, but perhaps also advancement. Most of the requirements have been historical ("I need access to the history, with some properties"), whereas the durability work seems like it might be a future constraint ("I will need history from here held separate and not advanced").

The poster-child for bad behavior at the moment is in the server project, where the trace handle for the random graph stream downgrades none of its capabilities, preventing merging and advancement. I can understand not downgrading advancement, if you want full historical detail, but not downgrading merging seems like just a performance loss there. On the other hand, we do want to be able to hook the batches for durability somehow.

from differential-dataflow.

utaal commented on July 30, 2024

whereas the durability work seems like it might be a future constraint. "I will need history from here held separate and not advanced."

Sounds right to me, right now, but there's a chance this can be relaxed.

I can understand not downgrading advancement, if you want full historical detail, but not downgrading merging seems like just a performance loss there.

I see, makes sense.

from differential-dataflow.

Remove `distinguish_since` capability about differential-dataflow HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent