Comments (8)
I've been looking into how to implement this. My initial plan was to have dqlite spawn a dedicated thread on startup, communicating with the main thread via a queue, and have the main thread offload all sqlite3 calls that touch a database to this thread. This would preserve a current property of the codebase: all such database operations happen on a single thread (right now, the main thread). This would make it sound to do sqlite3_config(SQLITE_SINGLE_THREAD)
in the dqlite process before startup, possibly improving performance.
This strategy has a problem: both sqlite3_step and our FSM implementation need to touch the VFS data structures, and the synchronous character of the raft_fsm interface forces us to perform the FSM-related VFS operations on the main thread. So if we want to run sqlite3_step on a dedicated thread, we need to make sure the VFS implementation is prepared for concurrent accesses, which it currently is not. I would like to fix that, but first I need to understand to what extent SQLite sychronizes its own calls to VFS functions -- i.e., imagining for the moment that we never touch the VFS data structures directly, does SQLite take care of serializing its own calls to xWrite, xShmMap, etc., or we have to use our own locks that we acquire in the implementation of those methods? @freeekanayaka, can you shed any light here?
(Instead of using a dedicated thread, we could run sqlite3_step on the libuv thread pool using uv_queue_work, but this still requires making the VFS thread-safe.)
from dqlite.
Ok, everything I'm going to say starts from the assumption that this issue is caused by:
- disk-mode being turned on in microk8s
- kine implementation being inefficient at emulating etcd in terms of SQL
If either 1) or 2) is wrong, please let me know and probably what I'm going to say can be discarded and reading it just a waste of time :)
So, it feels that the concerns that we had raised when discussing disk-mode are actually realizing, I'd encourage everybody to re-read #368 carefully, because I believe it mentions pretty much everything, from the issues that could happen with the "cheap" approach of disk-mode that has been put in place (i.e. simply storing the database on disk and using blocking I/O to access it) to the reasons why just moving sqlite3_step
(or any other part of the call stack) to a thread won't work per se.
If you bear with letting me to take again a step back, and talk from a high-level architectural/design point of view, I'll just re-iterate what I've said since the beginning of the microk8s/kine/dqlite idea years ago:
- SQL is intrinsically not a great way to emulate a watchable and linearizable key/value store like etcd
- kine's implementation of such emulation is probably very inefficient in its own way and might need a more careful and clever approach to work better (for example I suspect there is away to avoid needing to store so many gigabytes of data into SQL tables and just store some metadata)
- dqlite should be used for different kind of workloads, because the choice of it being in-memory is dictated by precise technical circumstances beyond our control (in short, the fact that SQLite's VFS interface is synchronous, see below).
I know I'm beating a dead horse, since years now, and apologies for the directness but unfortunately I feel things have been kind of put under the carpet and duck-taped all along, and eventually reality knocks the doors. I'm well aware that all the developers involved have little choice, since this all comes from management. But probably management never fully realized the technical ramifications of this endeavor.
Ok, enough ranting and big picture, sorry about that.
Regarding @cole-miller's questions, yes SQLite can absolutely "take care of serializing its own calls to xWrite, xShmMap, etc", or, to put it more precisely, SQLite supports several modes of operations with several thread-safety guarantees (see the docs) and in order to do that it will use locks. So SQLite and threads can be friends, no problems at all.
The issue though is the synchronous/asynchronous mismatch that you allude to. SQLite is synchronous in every aspect, in particular the VFS. While dqlite/raft are asynchronous, basically because the asynchronous approach is more efficient and simpler (in some way) when deal with the network, something SQLite does not have to do.
To deal with this mismatch we basically do something a bit tricky and unconventional, along the these lines:
- There is one or more
sqlite3_step
calls that begin and end a write transaction. - SQLite's VFS interface is a low level one, it has no concept of WAL, transactions or anything like that, just read/write from disk and acquire/release locks, however a clever VFS implementation that knows a lot about the WAL format and the internals of SQLite can detect that certain VFS calls mark the start of a write transaction, certain others add data to it and certain others mark the end of the transaction.
- Our VFS implementation does that, and in particular when a certain
sqlite3_step
triggers the end of a write transaction (e.g. because it executes aCOMMIT
statement) our VFS will behave as if everything was fine and the desired data were committed to disk. That means that thesqlite3_step
call will return and believe that the changes to the WAL were persisted. - However, internally, our VFS implementation did not commit the transaction to the WAL. Instead it keeps the WAL data structures in a state as if the transaction was still in progress. This means that all other concurrent transactions will not see those changes yet.
- Using the
VfsPoll
call the dqlite engine obtains from our VFS implementation all the data of the transaction, and use raft to persist it to the raft log. - When raft is done the dqlite engine calls
VfsApply
and only at that point our VFS implementation changes the WAL in way that will make the transaction committed and visible to other transactions and that will make other write transactions possible.
Basically SQLite is being tricked, it makes some synchronous VFS calls, but the final actual result happens asynchronously. I really hope I'm explaining this well enough.
Note that this all works because:
- real disk I/O is never performed by our VFS implementation. Disk I/O is only performed by raft, in an asynchronous way
- everything happens in a single thread so our VFS implementation can make a lot of assumptions and checks about the current state (both of SQLite and the VFS/dqlite itself). without worrying about concurrency
Now, if you want to change 1) or 2) or both, the ramifications are quite deep.
Note that avoiding threads and context switches is what makes dqlite very fast and efficient, for example compared to other distributed SQLite implementations (see the blog post I pointed out a while back).
Introducing threads into the equation is going to degrade performance and increase complexity a lot, no matter the approach (running sqlite3_step
in a thread, or just running disk I/O in a thread, or whatever).
We can discuss the details about the feasibility of that and how to achieve it, but I really hope we're all on the same page and understand it's a complex and deep change that is close to a rewrite of the core engine. That's what I was trying to say in #368, when the effort was started.
Thanks for bearing with me and reading this much, I hope this story was shorter but I don't think it really is.
from dqlite.
Thanks for the feedback, so assumption 1) is wrong at the moment, microk8s is still using vanilla in-memory dqlite.
from dqlite.
Thanks for the feedback, so assumption 1) is wrong at the moment, microk8s is still using vanilla in-memory dqlite.
Okay, then it might be sort of a kine-related issue, in the sense that the SQL-based emulation of the etcd model puts too much pressure on dqlite/SQLite. As mentioned in canonical/microk8s#3227, having a better idea of what's going on at the kine level is probably going to help. At that point one would ideally know what's so heavy in kine, and maybe be able to come up with a different design.
from dqlite.
By the way, why microk8s was that user saying that he started see problems only from a certain version onwards? What changed in that version? Or is he having the wrong impression?
from dqlite.
Related Issues (20)
- Error codes, the wire protocol, and clients HOT 12
- C client: re-using SQLite interfaces that don't do I/O HOT 2
- Assertion failure in lib/transport.c:alloc_cb
- Segfault in gateway.c:FAIL_IF_CHECKPOINTING HOT 1
- Systematically avoiding overflow for integer operations HOT 6
- Wire protocol limitations as a backend for the C client HOT 5
- Expose option to disable/enable raft snapshot compression
- Handle INTERRUPT request HOT 3
- stderr of server threads swallowed during integration tests HOT 2
- Proposal: stop trying to handle OOM HOT 1
- Investigate growing memory usage found by microk8s benchmarking HOT 2
- Consider shipping a tiny "manifest" binary to print information about the dqlite installation
- install instructions don't work HOT 4
- Confusing error message when trying to run a query statement with Exec HOT 1
- Recommended way to perform schema migrations HOT 4
- Raft uv_timer leak when creating and destroying node HOT 1
- Cluster-wide configuration of target voter/standby count
- Idea: pass a socket instead of dqlite_node_set_bind_address HOT 4
- Implement DQLITE_VISIBLE_TO_TESTS properly, or get rid of it HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dqlite.