Proposal: stop trying to handle OOM about dqlite HOT 1 CLOSED

cole-miller commented on July 18, 2024

Proposal: stop trying to handle OOM

from dqlite.

Comments (1)

freeekanayaka commented on July 18, 2024 1

Currently, dqlite and raft make an effort to handle allocation failures. I would like to explore the idea of unconditionally doing pthread_exit when this happens, or at least something closer to that extreme in the space of possible error-handling strategies.

Maybe we could offer that as an option, e.g. raft_set_abort_upon_oom()? However, I'd not remove the possibility for the user to handle it gracefully, see below.

I am not militant about this, but I think it would have some real benefits and in any case it would be good to get clear on what value RAFT_NOMEM et al. are providing (if only for my benefit slightly_smiling_face).

If we report an OOM failure (like any other failure) then the user as the choice of deciding what to do. If they want to abort using pthread_exit, that's easy to do. However they might also want do handle the error gracefully, for example in an embedded device with limited memory that should ideally not go down, or maybe because some malicious client is trying to attack the server in some way.

Like all other errors, leaving the choice to the user offers the most flexibility, especially for a low-level piece like libraft.

Possible advantages of trying less hard to handle allocation failure

The allocation failure handling code is a pain to test (though not impossible)

The current testing approach for OOM is basically to have parameterized tests that inject OOM failures progressively. You run the same test multiple times, but each time the memory allocator injects a failure at a different spot (e.g. in the first run the first call to malloc fails right way, in the second run the first call to malloc succeeds but the second fails, etc). That should already cover quite some ground, but we can surely improve it.

and I wouldn't be at all surprised if there are bugs there.

Yeah, we should definitely improve this. However, in most (if not all) cases I believe the bug wouldn't be related to the type of failure (OOM), but to the fact that a failure occurs at all and is not properly handled. In other words if we find such a bug, it's worth being fixed independently from the type of failure, because if later on we modify that particular buggy code in ways that it can legitimately produce other types of failures (non OOM), then the bug is almost certainly going to be still there.

The simpler our overall error-handling strategy is, the easier it is for us to implement it consistently and without introducing subtle bugs, and the better we can describe it to users of dqlite.

Perhaps until we feel more confident about the robustness of our error handling for this particular failure, we could turn on the abort-upon-oom option (raft_set_abort_upon_oom) in the projects for which we are direct raft users, e.g. dqlite and LXD.

Possible disadvantages

I'm pretty sure no current user of dqlite depends on it gracefully handling allocation failure.

Agreed. But that might change in the future, plus dqlite is not the only consumer of libraft.

Anybody who's using go-dqlite has already signed up for unrecoverable errors when memory is exhausted. But if dqlite gets more users in the future, via the C client, that have different requirements, then "OOM -> pthread_exit" might become a problem. And it would be pretty annoying to have to backtrack from the more brutal error-handling strategy to the more graceful one.

Right. I'd leave the door open and make abort-upon-oom an opt-in options, so we can (eventually) meet the needs of both audiences.

from dqlite.

Proposal: stop trying to handle OOM about dqlite HOT 1 CLOSED

Comments (1)

Possible advantages of trying less hard to handle allocation failure

Possible disadvantages

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent