Code Monkey home page Code Monkey logo

Comments (2)

snaury avatar snaury commented on July 18, 2024

I think I have a way to make atomic time that doesn't need any complicated trees and has update and read operations as O(1). The key aspect is that mediator time advances for all tablets, except those that are currently frozen at Step-1 because they have unacknowledged transactions at Step. And we need a way to freeze and unfreeze mediator time for a particular tablet, while updating all non-frozen tablets.

This can be achieved using two atomic counters:

  • Common is shared by all tablets and updated every time we need to advance mediator time for all non-frozen tablets
  • Local is set to Step-1 when tablet has unacknowledged transactions at Step, and Max<ui64>() otherwise

Operations:

  • Advance mediator time to Step: Common.store(Step, std::memory_order_release)
  • Freeze tablets at Step: Local.store(Step-1, std::memory_order_relaxed) for each tablet with the first unacknowledged transaction at Step, then advance time to Step
  • Unfreeze tablets: Local.store(Max<ui64>(), std::memory_order_relaxed) for each unfrozen tablet
  • Read current mediator time: ui64 common = Common.load(std::memory_order_acquire); ui64 local = Local.load(std::memory_order_relaxed); return Min(common, local);

Why this works:

  • Operation order between different atomic variables is not fixed (and may be reordered, just like any memory accesses), we can only rely on release/acquire order of a particular atomic variable and how it affects other reads.
  • The logical update order between Common and Local is synchronized to a single actor thread at any given time
  • When we perform a release-store to Common we release all previous memory stores, which will be acquired by an load-acquire from Common
  • Let's suppose a load-acquire from Common observes some value Step, the next load from Local is guaranteed to have at least the last value stored to Local before Step was stored to Common, but it may (or may not) also be some future value.
  • If the tablet was frozen at Step and not yet unfrozen, that value is guaranteed to be Step-1 and the minimum will stay at Step-1 even when Common is further updated to Step+∂. Thus a tablet that was frozen before the time is updated to Step is guaranteed to stay at Step-1 and stay unaffected by further updates to Common until it is unfrozen.
  • If a tablet was unfrozen at Step (or later) and then optionally frozen at Step+∂ we may observe Local to be either Max<ui64>() or Step+∂-1, but since we read Common to be Step the result will not exceed Step
  • Let's supposed we unfreeze a tablet and then update Common to Step+∂, is it guaranteed for this tablet's time to update to Step+∂ as well? Since we set Local to Max<ui64>() before updating Common we may read some intermediate value from Common between Step and Step+∂ and also observe Local to be either Step-1 or Max<ui64>(). Depending on the observed Local value we may also observe some intermediate time. Eventually tablet will read Commonto beStep+∂and afterwards it is guaranteed to observeLocalto beMax()and notStep-1`

All synchronization is done via Common, and we get a reliable way to freeze/unfreeze time for any particular tablet, while continuing to update time for idle tablets. Relaxed load/store to Local is enough (and proving correctness must be done using the weakest order possible), in practice it's better to also use acquire/release, because those are better at pulling the latest value from caches in weak architectures (especially under simulators like relacy). This is mostly important when unfreezing tablets, when one core is reading current time in a loop it may take a long time to finally observe the change from Step-1 to Step. Note however that this is true for any atomic read, observing the latest value is never guaranteed unless it's a failed cas.

from ydb.

snaury avatar snaury commented on July 18, 2024

After making an actual sketch of the streaming protocol I think I found some problems with the idea:

  1. Tablets don't initially know their list of coordinators/mediators. This is not a major issue, but it may require making a lot more changes than anticipated. For example, currently datashards register in the timecast service after the initial distributed transaction is complete, but in the future, with per-node mediators, they wouldn't even receive plan steps until they register. Currently datashards, columnshards, persqueue and schemeshard tablets participate in distributed transactions, and we would have to carefully review code and ensure tablets register very early, as soon as they learn their processing params. I'm not sure whether it's easy, especially with schemeshard, which may simultaneously participate in multiple coordinator sets due to legacy subdomains (which still exist in production).

  2. Possibly stuck transactions that don't have a destination. Imagine a tablet that receives a plan for a drop and self-deletes because coordinator could process an ack. Now this tablet has no subscribers, nobody is waiting for its transactions, but they are still in the queue. Currently mediators will eventually notice that pipe cannot connect and check whether the tablet is deleted, automatically acknowledging transactions that don't have a recipient. Coordinators will not use pipes and hence will not be able to auto ack transactions in the same way (or they would have to after some timeout?). This complicates things a lot.

I think we need a simpler solution to the timecast problem for now, i.e. support for timecast exclusion lists.

from ydb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.