Code Monkey home page Code Monkey logo

Comments (7)

jpoimboe avatar jpoimboe commented on August 19, 2024

@mhiramathitachi Thanks, this is a very interesting approach.

My biggest concern is that this will increase the chances of the safety check failing and returning EBUSY. In many (or even most) cases, patches will be applied by some tools automatically, and we really want to avoid having to try to tell the user that a patch failed to apply. I suspect that minimizing the chance of insmod failure is more important than avoiding stop_machine latencies.

What about the accuracy of the stack walk logic while the other CPUs are running processes? Is it safe to walk the stack of a running process that is changing the stack simultaneously?

What happens if the function never returns and the timer expires? Do we just force the counter to 0? If so, that could create complications if the function happens to return later during a future update.

Another concern is that it relies on CONFIG_KPROBES and CONFIG_KRETPROBES, which could be a problem if users want a more minimally configured kernel for some reason. Though this might be a minor point.

The fact that kretprobes aren't yet NMI safe may be a hindrance for merging something like this soon, since e.g. we'd lose support for the Fedora 20 kernel (but obviously that's just a short term problem).

from kpatch.

mhiramathitachi avatar mhiramathitachi commented on August 19, 2024

(2014/04/24 22:28), Josh Poimboeuf wrote:

My biggest concern is that this will increase the chances of the safety check failing and returning EBUSY.

I don't think so, this approach may increase the chance of applying patch, since it checks
every timing of exiting the target functions a while. Anyway, the failing chance will be
increased if we patch more functions at once. So I think incremental (separated) patch
should be discussed.

In many (or even most) cases, patches will be applied by some tools automatically, and we really want to avoid having to try to tell the user that a patch failed to apply. I suspect that minimizing the chance of insmod failure is more important than avoiding stop_machine latencies.

It depends on user demand, and I think some tools can handle safety check failure and
retry it. Also, we can support both stop_machine-less and stop_machine.

What about the accuracy of the stack walk logic while the other CPUs are running processes? Is it safe to walk the stack of a running process that is changing the stack simultaneously?

As far as taking a task_lock, the stack will be never freed and safe to walk.
Even if the process goes into the target functions, it is automatically counted in the ftrace
handler.

What happens if the function never returns and the timer expires? Do we just force the counter to 0? If so, that could create complications if the function happens to return later during a future update.

At that time, we'll remove (unregister) kretprobes, then its handler is just skipped. :)

Another concern is that it relies on CONFIG_KPROBES and CONFIG_KRETPROBES, which could be a problem if users want a more minimally configured kernel for some reason. Though this might be a minor point.

Without kretprobes, we have to use stop_machine. But I guess that such user may disable
CONFIG_FTRACE first...

The fact that kretprobes aren't yet NMI safe may be a hindrance for merging something like this soon, since e.g. we'd lose support for the Fedora 20 kernel (but obviously that's just a short term problem).

Yes, I think this will take a time. There are pros and cons on this method, and we have
a time to discuss it. :)

from kpatch.

jpoimboe avatar jpoimboe commented on August 19, 2024

Sorry, accidental close :-)

from kpatch.

jpoimboe avatar jpoimboe commented on August 19, 2024

On Sun, Apr 27, 2014 at 07:24:34AM -0700, Masami Hiramatsu (Hitachi) wrote:

(2014/04/24 22:28), Josh Poimboeuf wrote:

My biggest concern is that this will increase the chances of the safety check failing and returning EBUSY.

I don't think so, this approach may increase the chance of applying patch, since it checks
every timing of exiting the target functions a while. Anyway, the failing chance will be
increased if we patch more functions at once. So I think incremental (separated) patch
should be discussed.

Well, for example, at least functions (and their subfunctions) which
never sleep, or which run with preemption disabled, will be more likely
to fail, since they could never even be on the stack during
stop_machine.

Also it will be somewhat less deterministic, which makes testing less
reliable for users deploying to lots of systems.

Incremental patching is a useful feature, but it's inherently more
dangerous, because each patch is built against the kernel, not against
the other patches. I talked about this in #81.

In many (or even most) cases, patches will be applied by some tools automatically, and we really want to avoid having to try to tell the user that a patch failed to apply. I suspect that minimizing the chance of insmod failure is more important than avoiding stop_machine latencies.

It depends on user demand, and I think some tools can handle safety check failure and
retry it.

Yeah, it's difficult to say without hearing more from real users.

Another idea to avoid safety failures is that if we detect an in-use
function, we can set an ftrace handler for the function, wake it up, and
then retry the patching procedure. Though I'm not sure whether it would
be a good idea.

Also, we can support both stop_machine-less and stop_machine.

This is possible, but to me it seems like a lot of complexity for not
much gain. What exactly are the problems with stop_machine we're trying
to solve? Latency? -EBUSY failures?

It's hard to gauge the importance of these limitations without having
more users to complain about them :-) But as one of the few current
users of kpatch, they haven't affected me yet.

What about the accuracy of the stack walk logic while the other CPUs are running processes? Is it safe to walk the stack of a running process that is changing the stack simultaneously?

As far as taking a task_lock, the stack will be never freed and safe to walk.
Even if the process goes into the target functions, it is automatically counted in the ftrace
handler.

But even with a task_lock, the task can can change the stack while we're
attempting to read it. We could see inconsistencies or corruption.

What happens if the function never returns and the timer expires? Do we just force the counter to 0? If so, that could create complications if the function happens to return later during a future update.

At that time, we'll remove (unregister) kretprobes, then its handler is just skipped. :)

Hm, sounds racy to me. More details would help :-)

from kpatch.

mhiramathitachi avatar mhiramathitachi commented on August 19, 2024

But even with a task_lock, the task can can change the stack while we're
attempting to read it. We could see inconsistencies or corruption.

To solve this problem, I've decided to use a running task list and check those tasks in a hook of context switching. :) In this way, all sleeping tasks can be checked safely and other running tasks are listed. After that hooking the context switch and wait for all running tasks are switched. Since all running tasks are checked in its context, it should be safely done.

from kpatch.

mhiramathitachi avatar mhiramathitachi commented on August 19, 2024

Here, I've pushed patches on my tree.
https://github.com/mhiramathitachi/kpatch/tree/no-stopmachine-v1
Note that this is just a proof of concept, not for upstream. We still need some enhancements on upstream kernel (at least IPMODIFY series required). But anyway, you can see how it works :)

from kpatch.

jpoimboe avatar jpoimboe commented on August 19, 2024

I think we can close this issue. kpatch.ko can continue to use stop_machine until it retires (when livepatch closes the feature gap).

from kpatch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.