Comments (2)
I think I have a way to make atomic time that doesn't need any complicated trees and has update and read operations as O(1). The key aspect is that mediator time advances for all tablets, except those that are currently frozen at Step-1
because they have unacknowledged transactions at Step
. And we need a way to freeze and unfreeze mediator time for a particular tablet, while updating all non-frozen tablets.
This can be achieved using two atomic counters:
Common
is shared by all tablets and updated every time we need to advance mediator time for all non-frozen tabletsLocal
is set toStep-1
when tablet has unacknowledged transactions atStep
, andMax<ui64>()
otherwise
Operations:
- Advance mediator time to
Step
:Common.store(Step, std::memory_order_release)
- Freeze tablets at
Step
:Local.store(Step-1, std::memory_order_relaxed)
for each tablet with the first unacknowledged transaction atStep
, then advance time toStep
- Unfreeze tablets:
Local.store(Max<ui64>(), std::memory_order_relaxed)
for each unfrozen tablet - Read current mediator time:
ui64 common = Common.load(std::memory_order_acquire); ui64 local = Local.load(std::memory_order_relaxed); return Min(common, local);
Why this works:
- Operation order between different atomic variables is not fixed (and may be reordered, just like any memory accesses), we can only rely on release/acquire order of a particular atomic variable and how it affects other reads.
- The logical update order between
Common
andLocal
is synchronized to a single actor thread at any given time - When we perform a release-store to
Common
we release all previous memory stores, which will be acquired by an load-acquire fromCommon
- Let's suppose a load-acquire from
Common
observes some valueStep
, the next load fromLocal
is guaranteed to have at least the last value stored toLocal
beforeStep
was stored toCommon
, but it may (or may not) also be some future value. - If the tablet was frozen at
Step
and not yet unfrozen, that value is guaranteed to beStep-1
and the minimum will stay atStep-1
even whenCommon
is further updated toStep+∂
. Thus a tablet that was frozen before the time is updated toStep
is guaranteed to stay atStep-1
and stay unaffected by further updates toCommon
until it is unfrozen. - If a tablet was unfrozen at
Step
(or later) and then optionally frozen atStep+∂
we may observeLocal
to be eitherMax<ui64>()
orStep+∂-1
, but since we readCommon
to beStep
the result will not exceedStep
- Let's supposed we unfreeze a tablet and then update
Common
toStep+∂
, is it guaranteed for this tablet's time to update toStep+∂
as well? Since we setLocal
toMax<ui64>()
before updatingCommon
we may read some intermediate value fromCommon
betweenStep
andStep+∂
and also observeLocal
to be eitherStep-1
orMax<ui64>()
. Depending on the observedLocal
value we may also observe some intermediate time. Eventually tablet will read
Commonto be
Step+∂and afterwards it is guaranteed to observe
Localto be
Max()and not
Step-1`
All synchronization is done via Common
, and we get a reliable way to freeze/unfreeze time for any particular tablet, while continuing to update time for idle tablets. Relaxed load/store to Local
is enough (and proving correctness must be done using the weakest order possible), in practice it's better to also use acquire/release, because those are better at pulling the latest value from caches in weak architectures (especially under simulators like relacy). This is mostly important when unfreezing tablets, when one core is reading current time in a loop it may take a long time to finally observe the change from Step-1
to Step
. Note however that this is true for any atomic read, observing the latest value is never guaranteed unless it's a failed cas.
from ydb.
After making an actual sketch of the streaming protocol I think I found some problems with the idea:
-
Tablets don't initially know their list of coordinators/mediators. This is not a major issue, but it may require making a lot more changes than anticipated. For example, currently datashards register in the timecast service after the initial distributed transaction is complete, but in the future, with per-node mediators, they wouldn't even receive plan steps until they register. Currently datashards, columnshards, persqueue and schemeshard tablets participate in distributed transactions, and we would have to carefully review code and ensure tablets register very early, as soon as they learn their processing params. I'm not sure whether it's easy, especially with schemeshard, which may simultaneously participate in multiple coordinator sets due to legacy subdomains (which still exist in production).
-
Possibly stuck transactions that don't have a destination. Imagine a tablet that receives a plan for a drop and self-deletes because coordinator could process an ack. Now this tablet has no subscribers, nobody is waiting for its transactions, but they are still in the queue. Currently mediators will eventually notice that pipe cannot connect and check whether the tablet is deleted, automatically acknowledging transactions that don't have a recipient. Coordinators will not use pipes and hence will not be able to auto ack transactions in the same way (or they would have to after some timeout?). This complicates things a lot.
I think we need a simpler solution to the timecast problem for now, i.e. support for timecast exclusion lists.
from ydb.
Related Issues (20)
- Report Table Stats for column tables
- preserialize `DescribeSchemeResult.PathDescription.Table.SplitBoundary`
- Проблемы детерминированных запросов Clickbench
- Bulk_upsert timed out, duration: 300 sec
- schemeshard: reject too massive operations
- Log body contents on POST request to query handler
- [QueryService] Upserting and replacing rows with RETURNING raise INTERNAL ERROR
- Display information about resources consumed by bulk_upsert
- Add checks for TPC-H/DS tests that they are performed entirely in block mode.
- [pg/pgwire] Bad type assumptions at parametrized queries.
- [pg] VariableSetStmt, search path supports only 'information_schema', 'public', 'pg_catalog', '' but got: 'local'
- [[Aardappel]]: Auth related settings
- [pg] Projection column may only use alone in order by
- [[Aardappel]]: Check compatible stable ydb versions
- Persist IndexImplTableDescription when building index
- Проблемы запуска tpcds 1 на нагрузочных кластреах
- Проблемы запуска tpcds 10 на нагрузочных кластреах
- QueryService: WINDOW with ORDER BY results differ from YQL Script HOT 1
- crash on left join
- dev: switch YQL gramar from antlr3 to antlr4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ydb.