Right now, dqlite runs sqlite3_step and other database operations on the libuv main th

Moving sqlite3_step and other database operations off the main thread about dqlite HOT 8 OPEN

cole-miller commented on July 20, 2024

Moving sqlite3_step and other database operations off the main thread

from dqlite.

Comments (8)

cole-miller commented on July 20, 2024

I've been looking into how to implement this. My initial plan was to have dqlite spawn a dedicated thread on startup, communicating with the main thread via a queue, and have the main thread offload all sqlite3 calls that touch a database to this thread. This would preserve a current property of the codebase: all such database operations happen on a single thread (right now, the main thread). This would make it sound to do sqlite3_config(SQLITE_SINGLE_THREAD) in the dqlite process before startup, possibly improving performance.

This strategy has a problem: both sqlite3_step and our FSM implementation need to touch the VFS data structures, and the synchronous character of the raft_fsm interface forces us to perform the FSM-related VFS operations on the main thread. So if we want to run sqlite3_step on a dedicated thread, we need to make sure the VFS implementation is prepared for concurrent accesses, which it currently is not. I would like to fix that, but first I need to understand to what extent SQLite sychronizes its own calls to VFS functions -- i.e., imagining for the moment that we never touch the VFS data structures directly, does SQLite take care of serializing its own calls to xWrite, xShmMap, etc., or we have to use our own locks that we acquire in the implementation of those methods? @freeekanayaka, can you shed any light here?

(Instead of using a dedicated thread, we could run sqlite3_step on the libuv thread pool using uv_queue_work, but this still requires making the VFS thread-safe.)

from dqlite.

freeekanayaka commented on July 20, 2024

Ok, everything I'm going to say starts from the assumption that this issue is caused by:

disk-mode being turned on in microk8s
kine implementation being inefficient at emulating etcd in terms of SQL

If either 1) or 2) is wrong, please let me know and probably what I'm going to say can be discarded and reading it just a waste of time :)

So, it feels that the concerns that we had raised when discussing disk-mode are actually realizing, I'd encourage everybody to re-read #368 carefully, because I believe it mentions pretty much everything, from the issues that could happen with the "cheap" approach of disk-mode that has been put in place (i.e. simply storing the database on disk and using blocking I/O to access it) to the reasons why just moving sqlite3_step (or any other part of the call stack) to a thread won't work per se.

If you bear with letting me to take again a step back, and talk from a high-level architectural/design point of view, I'll just re-iterate what I've said since the beginning of the microk8s/kine/dqlite idea years ago:

SQL is intrinsically not a great way to emulate a watchable and linearizable key/value store like etcd
kine's implementation of such emulation is probably very inefficient in its own way and might need a more careful and clever approach to work better (for example I suspect there is away to avoid needing to store so many gigabytes of data into SQL tables and just store some metadata)
dqlite should be used for different kind of workloads, because the choice of it being in-memory is dictated by precise technical circumstances beyond our control (in short, the fact that SQLite's VFS interface is synchronous, see below).

I know I'm beating a dead horse, since years now, and apologies for the directness but unfortunately I feel things have been kind of put under the carpet and duck-taped all along, and eventually reality knocks the doors. I'm well aware that all the developers involved have little choice, since this all comes from management. But probably management never fully realized the technical ramifications of this endeavor.

Ok, enough ranting and big picture, sorry about that.

Regarding @cole-miller's questions, yes SQLite can absolutely "take care of serializing its own calls to xWrite, xShmMap, etc", or, to put it more precisely, SQLite supports several modes of operations with several thread-safety guarantees (see the docs) and in order to do that it will use locks. So SQLite and threads can be friends, no problems at all.

The issue though is the synchronous/asynchronous mismatch that you allude to. SQLite is synchronous in every aspect, in particular the VFS. While dqlite/raft are asynchronous, basically because the asynchronous approach is more efficient and simpler (in some way) when deal with the network, something SQLite does not have to do.

To deal with this mismatch we basically do something a bit tricky and unconventional, along the these lines:

There is one or more sqlite3_step calls that begin and end a write transaction.
SQLite's VFS interface is a low level one, it has no concept of WAL, transactions or anything like that, just read/write from disk and acquire/release locks, however a clever VFS implementation that knows a lot about the WAL format and the internals of SQLite can detect that certain VFS calls mark the start of a write transaction, certain others add data to it and certain others mark the end of the transaction.
Our VFS implementation does that, and in particular when a certain sqlite3_step triggers the end of a write transaction (e.g. because it executes a COMMIT statement) our VFS will behave as if everything was fine and the desired data were committed to disk. That means that the sqlite3_step call will return and believe that the changes to the WAL were persisted.
However, internally, our VFS implementation did not commit the transaction to the WAL. Instead it keeps the WAL data structures in a state as if the transaction was still in progress. This means that all other concurrent transactions will not see those changes yet.
Using the VfsPoll call the dqlite engine obtains from our VFS implementation all the data of the transaction, and use raft to persist it to the raft log.
When raft is done the dqlite engine calls VfsApply and only at that point our VFS implementation changes the WAL in way that will make the transaction committed and visible to other transactions and that will make other write transactions possible.

Basically SQLite is being tricked, it makes some synchronous VFS calls, but the final actual result happens asynchronously. I really hope I'm explaining this well enough.

Note that this all works because:

real disk I/O is never performed by our VFS implementation. Disk I/O is only performed by raft, in an asynchronous way
everything happens in a single thread so our VFS implementation can make a lot of assumptions and checks about the current state (both of SQLite and the VFS/dqlite itself). without worrying about concurrency

Now, if you want to change 1) or 2) or both, the ramifications are quite deep.

Note that avoiding threads and context switches is what makes dqlite very fast and efficient, for example compared to other distributed SQLite implementations (see the blog post I pointed out a while back).

Introducing threads into the equation is going to degrade performance and increase complexity a lot, no matter the approach (running sqlite3_step in a thread, or just running disk I/O in a thread, or whatever).

We can discuss the details about the feasibility of that and how to achieve it, but I really hope we're all on the same page and understand it's a complex and deep change that is close to a rewrite of the core engine. That's what I was trying to say in #368, when the effort was started.

Thanks for bearing with me and reading this much, I hope this story was shorter but I don't think it really is.

from dqlite.

MathieuBordere commented on July 20, 2024

Thanks for the feedback, so assumption 1) is wrong at the moment, microk8s is still using vanilla in-memory dqlite.

from dqlite.

freeekanayaka commented on July 20, 2024

Thanks for the feedback, so assumption 1) is wrong at the moment, microk8s is still using vanilla in-memory dqlite.

Okay, then it might be sort of a kine-related issue, in the sense that the SQL-based emulation of the etcd model puts too much pressure on dqlite/SQLite. As mentioned in canonical/microk8s#3227, having a better idea of what's going on at the kine level is probably going to help. At that point one would ideally know what's so heavy in kine, and maybe be able to come up with a different design.

from dqlite.

freeekanayaka commented on July 20, 2024

By the way, why microk8s was that user saying that he started see problems only from a certain version onwards? What changed in that version? Or is he having the wrong impression?

from dqlite.

Moving sqlite3_step and other database operations off the main thread about dqlite HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent