When leadership is lost while applying a WalFrames command with commit=1, the Methods

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Implement two phase commit about dqlite HOT 6 OPEN

canonical commented on May 24, 2024 3

Implement two phase commit

from dqlite.

Comments (6)

paulstuart commented on May 24, 2024

@freeekanayaka, this issue is still a problem (I would like to help resolve it if I can). It's rather easy to recreate:

Create the test table create table simple (id integer primary key, other integer)
Start a long running loop that inserts the last known id value, e.g. insert into simple (other) values(?), where other is derived from results. LastInsertId(). The key point is that id and other should always be equal.
Start a second loop that iterates over the node numbers and transfers leadership to the next node.
There will be errors that occur in the primary loop, and on occasion there will be a mismatch between id and other. That is the bug in question.

from dqlite.

paulstuart commented on May 24, 2024

In framesAbortBecauseLeadershipLost (replication.c), there's an if/else statement based on is_commit, but the handling is exactly the same for both cases.

from dqlite.

freeekanayaka commented on May 24, 2024

Interesting breakdown. As first step, I'd suggest to put in place a unit test or at least a program that implements the procedure you outline and fails as you mention. With that at hand it should be easier to further investigate the issue, come up with a design for the solution, implement it and prove that it works (the unit test doesn't fail anymore).

The term "two phase commit" is probably inappropriate, as raft is by itself two-phase (a quorum is needed).

I suspect the issue here has more to do with client and server behavior when leadership is lost. The raft paper describes roughly what should happen: an operation ID should be maintained for each client request, if a request (such as committing the transaction performing the INSERT) fails then the client should retry the request, presenting the same operation ID to the new leader. In turn, the new leader should either perform the request, or no-op it if it turns out that the request was actually performed, but the client failed to receive the confirmation because the leader it initially submitted it to had died and could not notify the client back.

from dqlite.

freeekanayaka commented on May 24, 2024

So far I've deferred addressing this issue since I suspect it requires a fair amount of thinking and work, however I still intend to nail when I'll have some time.

from dqlite.

paulstuart commented on May 24, 2024

I'd like to do anything I can to lessen the load for you, as this is important to my project. Your original notes appear to be out of date, so any further brain dumps would be welcomed.

Testing this issue is a pain because it requires running a cluster under load and simultaneously hammering it with repeated transfers (or server restarts) until the magic moment occurs.

One thought was to add a "sleep" function to sqlite statements to create a long running transaction that is easier to test such actions mid-transaction. If you think that would be valuable I'd be happy to get that going.

from dqlite.

freeekanayaka commented on May 24, 2024

Yes, those notes are out of date. The brain dump is basically what I wrote (assuming the issue is what I think it is), although that's admittedly hand-waving.

As said, coming up with a program that if ran long enough eventually reproduces the error would be probably a very good start.

from dqlite.

Implement two phase commit about dqlite HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent