Comments (8)
If we were to keep them both around, what about merge_batches_until
and distinguish_(timestamps_)since
?
Other options for advance_by
:
keep_history_since
;advance_timestamps_until/to
.
from differential-dataflow.
I started on these edits, and one complication shows up (at least one):
It is currently possible to have an Arranged
source of data, meaning stream of batches and indexed trace, where all of the trace handles have been dropped and so the trace is no longer maintained. An example of this is in the join
operator which wants each input arranged, but only needs to maintain the history if the opposite stream is still active. If an opposite stream is terminated (corresponding to now immutable data), the first stream can cease its trace maintenance (it should still send arranged batches along, but needn't maintain (and merge) a list of historical batches).
This worked in part because access to the trace was mediated through the trace handle, and by dropping it there was no other route for trace data to arrive. If instead we ship (Batch, Vec<Batch>)
data along channels, ... we need to be clear about what this all means.
There is nothing fundamentally hard about doing this, in that we can use the same signals (trace handle drop) to clean up the trace, at which point we have no batches to fill up a Vec<Batch>
which .. is fine as long as the recipient understands that having dropped the trace handle these are no longer valid history (i.e., an empty list does not indicate an empty history, just an absent history).
This doesn't seem hard to address "by convention", but it does signal that we are probably planning on accessing trace data without using the help of the trace. We could instead ship (Batch, Option<Trace::Cursor::Storage>)
, which would still require the trace's help to get a cursor, and only allow navigation if we are both received a Some(_)
variant and have a trace handle from which to grab a cursor.
from differential-dataflow.
To have time to serialise batches for durability purposes (unless this is done at batch construction time, somehow), it may be helpful to keep the batches in advance of (the old) advance_by
from being merged (while they're being written down): this way one could trivially write down all the new batches one by one, and just load them all as non-merged batches on recovery.
I may need a clarification: with the current (and proposed) interface, merging two batches a
and b
may result in a merged batch c
with c.since > a.since
and c.since > b.since
, if the advance_by
frontier is in advance of both a.upper
and b.upper
? (where since
, upper
as defined in Description
)
from differential-dataflow.
I was guessing that serialization would probably be done at batch creation time, but I see your point if not. I have some other reservations about the distinguish_since
API and the proposed fix, so let's think out what we would want from it.
I like a "bookmark" sort of feature, where you can hold on to a capability for a batch boundary (e.g. a lower
or an upper
) that prevents merging across that boundary. I think that this does something like what serialization would want (e.g. "I've serialized up to upper
; please don't merge across that, but feel free to merge among batches after it"). Does that sound believable to you?
Edit: Alternately / similarly, the capability might not prevent merging but ensure that you can get access to the trace split at this point (e.g. when we merge across, the source batches aren't dropped as the capability could hold a reference to them). This made more sense when the capability was for "batches up to here", and maybe makes less sense for "batches after here, cleanly separated".
from differential-dataflow.
I like a "bookmark" sort of feature, where you can hold on to a capability for a batch boundary (e.g. a lower or an upper) that prevents merging across that boundary. I think that this does something like what serialization would want (e.g. "I've serialized up to upper; please don't merge across that, but feel free to merge among batches after it"). Does that sound believable to you?
Would this also ensure that since
≤ upper
? I.e. prevent merging and timestamp advancement across that boundary?
from differential-dataflow.
I may need a clarification: with the current (and proposed) interface, merging two batches a and b may result in a merged batch c with c.since > a.since and c.since > b.since?
I think that when you merge two batches, you need to advance the since
bound to not make any promises that the merged data cannot provide. What probably happens at the moment (and the code tests for) is that one of a.since
and b.since
should be in advance of the other (because the trace's frontier used for them only advances), and we pick the more advanced one. Otherwise, we would need to pick a frontier in advance of both of them.
from differential-dataflow.
Would this also ensure that since ≤ upper? I.e. prevent merging and timestamp advancement across that boundary?
No, that was not the intent. If you want to prevent advancement, you can indicate that with a trace handle by not advancing its advance_by
(gah, the names...).
Perhaps I should try and clear up an unspoken desideratum, which is that holding these capabilities shouldn't cripple the other users of the trace by preventing e.g. merging, but perhaps also advancement. Most of the requirements have been historical ("I need access to the history, with some properties"), whereas the durability work seems like it might be a future constraint ("I will need history from here held separate and not advanced").
The poster-child for bad behavior at the moment is in the server
project, where the trace handle for the random graph stream downgrades none of its capabilities, preventing merging and advancement. I can understand not downgrading advancement, if you want full historical detail, but not downgrading merging seems like just a performance loss there. On the other hand, we do want to be able to hook the batches for durability somehow.
from differential-dataflow.
whereas the durability work seems like it might be a future constraint. "I will need history from here held separate and not advanced."
Sounds right to me, right now, but there's a chance this can be relaxed.
I can understand not downgrading advancement, if you want full historical detail, but not downgrading merging seems like just a performance loss there.
I see, makes sense.
from differential-dataflow.
Related Issues (20)
- Strategies for maintaining persistent states (the data in Collections) HOT 2
- Consolidate Timestamps and Time Windowed Dataflows
- what different with flink Retraction
- Optional Abomonation? HOT 1
- Replicate Cross Join Situation HOT 2
- Operator to flatten `Collection<Collection<G, D, R>>` into `Collection<G, D, R>`
- Difficulty understanding how to use prefix_sum / how to implement topK HOT 6
- miri: Undefined Behavior: trying to retag from <20432167> for Unique permission in push_unchecked HOT 2
- Support `TimelyStack` as storage for `(T, R)` in arrangement leafs HOT 1
- Holding on to a trace with physical/logical compaction to the empty frontier stalls compaction
- Question: how to change data timestamp for late arriving data HOT 4
- Question: how to query data from past timestamps? HOT 5
- maybe the doc should add some instructions at geting started section
- Does all data have to be in memory? HOT 3
- Revisit the stashing logic in MergeBatcherColumnation
- Arrangement batch formation costs in proportion to outstanding updates HOT 3
- Suggestion: WebAssembly support HOT 5
- Getting Started Guide for Newcomers Doesn't Work HOT 3
- Improve clarity around `Cursor` method requirements
- Implement flat container support for `PointStamp`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from differential-dataflow.