The fact that kvdo requires tuning of threads explicitly seems to suggest (naive guess

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

alternate scheduling about kvdo HOT 3 OPEN

akseg73 commented on August 17, 2024 1

alternate scheduling

from kvdo.

Comments (3)

raeburn commented on August 17, 2024

I’m not really familiar with Go and its coroutines, so I’m not sure what you imagine that would look like when used inside the Linux kernel. As I understand it from what I’ve read, goroutines run efficiently by explicitly managing switching between them within one thread (or maybe more than one) as seen by the kernel -- essentially doing its own second level of scheduling independent of the kernel, so that context switches into the kernel can be reduced. It also needs its own management of stack space and such too because of that.

We looked long ago at running everything in one thread, and at least when fast storage is available, and data is generated quickly, one thread quickly becomes a bottleneck. So we went with multiple threads and message passing, with most of the worker threads owning certain data structures (address maps, allocation tables, etc), and put a fair amount of effort into avoiding or reducing synchronization between threads, even the need for locks in most cases. With fast storage, VDO can consume enough CPU time to keep multiple cores busy.

Our message passing approach seems to work pretty well under heavy load, which is our main use case; when a worker thread is resumed, it’s likely to have a bunch of work waiting for it or about to be queued up by other running threads, so it can process a bunch of updates to the data structures it owns before going back to sleep. Though, when lightly loaded, it’s probably not as efficient as one might like.

I did at one point look at an approach that would use our current message passing but instead of creating a dedicated thread for each worker, used a single work_struct per “worker” that would run under the kernel’s workqueues, letting that mechanism determine how many actual threads would be created. (Sort of a three-layer work scheduling system -- kernel threads, workqueue tasks within a worker thread, and VDO tasks within a workqueue task.) It didn’t perform as well as the dedicated threads; I didn’t have time to investigate why, though I’ve got a few guesses. Creating single-threaded kernel workqueues (to enforce serialization around data structures) and making each I/O request use its own work_struct also performed poorly.

(Also, we were asked a while back about comparing against using mutexes rather than passing the I/O requests off between threads so much. But having not written the driver to be switched between models, we’ve kind of baked the design into the code in places, including assumptions about processing messages in a certain order. So it’s not the easiest experiment to run.)

I can imagine a few scheduler tweaks that might help VDO. For example, being able to say, “I’m done but if I haven’t used too much time, go run another thread from this group”. Being able to express that certain pairs or groups of threads exchange messages a lot and it might be helpful to keep them on the same core or same NUMA node, without having to externally dictate exactly which cores are allowed to be used for each thread.

having a design in which multiple threads could be waiting on each other is inherent problematic.

The threads don’t use a “send request, block waiting for result” type approach, if that’s what you’re picturing. The I/O requests are generally either moving forward through the state machine, or parked in one of various wait queues until certain events take place (usually completion of I/O to backing storage). Except for certain threads designated to handle actions that can block, none of the worker threads should block until they’ve completed their available work and moved all the pending I/O requests to their next steps.

Have you considered a design which may not require explicit tuning of threads?

I would like to see such a design. Of course, it’d be easy to simply remove the knobs, fix the values to compiled-in constants, and live with whatever the resulting performance is. Getting performance comparable to what we can get now with tuning would be the trick. (Ideally, something dynamically adaptable to a workload changing over time, which our current set of knobs is not.) It needs to strike a good balance between cache contention (between cores and between NUMA nodes), queueing delays, lock contention, excessive calls into the scheduler, other demands in the system for CPU cycles, throughput vs latency, performance vs power, etc.

from kvdo.

akseg73 commented on August 17, 2024

@raeburn thanks for your detailed response.

from kvdo.

akseg73 commented on August 17, 2024

ZFS also offers compression and dedup and while it has a lot of tunables, don't recall there being a need for explicit tuning of threads (could be wrong).

from kvdo.

alternate scheduling about kvdo HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent