Comments (1)
Currently, dqlite and raft make an effort to handle allocation failures. I would like to explore the idea of unconditionally doing pthread_exit when this happens, or at least something closer to that extreme in the space of possible error-handling strategies.
Maybe we could offer that as an option, e.g. raft_set_abort_upon_oom()
? However, I'd not remove the possibility for the user to handle it gracefully, see below.
I am not militant about this, but I think it would have some real benefits and in any case it would be good to get clear on what value RAFT_NOMEM et al. are providing (if only for my benefit slightly_smiling_face).
If we report an OOM failure (like any other failure) then the user as the choice of deciding what to do. If they want to abort using pthread_exit, that's easy to do. However they might also want do handle the error gracefully, for example in an embedded device with limited memory that should ideally not go down, or maybe because some malicious client is trying to attack the server in some way.
Like all other errors, leaving the choice to the user offers the most flexibility, especially for a low-level piece like libraft.
Possible advantages of trying less hard to handle allocation failure
The allocation failure handling code is a pain to test (though not impossible)
The current testing approach for OOM is basically to have parameterized tests that inject OOM failures progressively. You run the same test multiple times, but each time the memory allocator injects a failure at a different spot (e.g. in the first run the first call to malloc fails right way, in the second run the first call to malloc succeeds but the second fails, etc). That should already cover quite some ground, but we can surely improve it.
and I wouldn't be at all surprised if there are bugs there.
Yeah, we should definitely improve this. However, in most (if not all) cases I believe the bug wouldn't be related to the type of failure (OOM), but to the fact that a failure occurs at all and is not properly handled. In other words if we find such a bug, it's worth being fixed independently from the type of failure, because if later on we modify that particular buggy code in ways that it can legitimately produce other types of failures (non OOM), then the bug is almost certainly going to be still there.
The simpler our overall error-handling strategy is, the easier it is for us to implement it consistently and without introducing subtle bugs, and the better we can describe it to users of dqlite.
Perhaps until we feel more confident about the robustness of our error handling for this particular failure, we could turn on the abort-upon-oom option (raft_set_abort_upon_oom
) in the projects for which we are direct raft users, e.g. dqlite and LXD.
Possible disadvantages
I'm pretty sure no current user of dqlite depends on it gracefully handling allocation failure.
Agreed. But that might change in the future, plus dqlite is not the only consumer of libraft.
Anybody who's using go-dqlite has already signed up for unrecoverable errors when memory is exhausted. But if dqlite gets more users in the future, via the C client, that have different requirements, then "OOM -> pthread_exit" might become a problem. And it would be pretty annoying to have to backtrack from the more brutal error-handling strategy to the more graceful one.
Right. I'd leave the door open and make abort-upon-oom an opt-in options, so we can (eventually) meet the needs of both audiences.
from dqlite.
Related Issues (20)
- Consider shipping a tiny "manifest" binary to print information about the dqlite installation
- install instructions don't work HOT 4
- Confusing error message when trying to run a query statement with Exec HOT 1
- Recommended way to perform schema migrations HOT 4
- Raft uv_timer leak when creating and destroying node HOT 1
- Cluster-wide configuration of target voter/standby count
- Idea: pass a socket instead of dqlite_node_set_bind_address HOT 4
- Implement DQLITE_VISIBLE_TO_TESTS properly, or get rid of it HOT 1
- Git layout for v1.15.0 HOT 8
- Write operations that immediately follow write operations sometimes cause a disk I/O-error, followed by loss of leadership and high latency HOT 3
- Support the RETURNING clause HOT 2
- Can we use the unix-excl VFS? HOT 9
- Consider periodically using VACUUM to reduce memory footprint HOT 9
- Not Leader failure response HOT 5
- Multi threading bug in tracef() HOT 1
- Jepsen: assertion failure in vfs.c HOT 1
- Nested barriers
- 1.16.0: test suite fails HOT 17
- Method to be informed about latest change in table HOT 2
- Several singular nodes joined later together HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dqlite.