During early volume tests there was an issue whereby the Penciller Clerk would take th

Supervision of Clerks, Supervision in General about leveled HOT 8 CLOSED

martinsumner commented on July 20, 2024 2

Supervision of Clerks, Supervision in General

from leveled.

Comments (8)

martinsumner commented on July 20, 2024

With regards to the clerks it is not clear why clerks are permanent actors. They don't hold any significant state beyond the current job they're running - and require prompting to do any work.

So I think it would be worth looking at starting and stopping clerks on a job-by-job basis

from leveled.

martinsumner commented on July 20, 2024

To try and understand better what to do here, thought it would be worth looking at how other Riak backends work in terms of supervision - and what I can learn from them

from leveled.

martinsumner commented on July 20, 2024

Bitcask

Starting a bitcask app starts a supervisor which supervises two workers - a merge worker and a merge delete worker. When bitcask wants to prompt a merge these workers are called as a locally registered name.

The actual bitcask work itself doesn't seem to depend on starting a process - there is no bitcask server. On startup no pid is generated, the files are opened and key map built up through a function which creates a state of the bitcask server, a reference is created for the bitcask instance, and the state is stored against the reference. The reference is passed back to the function that called open, and then that reference is then used by that parent process when calling other functions. For example a get request is passed the reference, then the state is fetched from the reference to fulfil the get. The work to [perform the GET is managed by the process calling GET, not passed to a bitcask worker process.

Files that are opened by bitcask have a filestate record that contains their file descriptor. The filestate records are stored within the state record, which is in turn attached to the reference. So unlike leveled there is not a mapping of process to file. This means that for snapshots, the snapshot just calls open and opens up all the files again (i.e. a snapshot will have a new set of file handles).

So bitcask doesn't really have a supervision tree, as because, other than for the merge workers, there are no processes to supervise.

from leveled.

martinsumner commented on July 20, 2024

Also with bitcask. When multiple bitcask databases are in action on the same node, it looks like this is handled by starting the bitcask application only once - and swallowing the already_started condition when it occurs https://github.com/basho/bitcask/blob/develop/src/bitcask.erl#L1143-L1152. I assume therefore, that on each node, there will be only one merge_worker and merge_delete_worker started by the supervisor (so multiple vnodes cannot perform merges at the same time).

The merge_worker appears to have a queue. When riak_kv_bitcask_backend gets a callback to merge, it checks the status of the queue (https://github.com/basho/riak_kv/blob/2.1.7/src/riak_kv_bitcask_backend.erl#L469-L483), and only checks for a required merge if the queue is empty. When an actual merge is requested, a worker pid is spawned to perform the merge, and any other work will be queued if a worker PID has already been spawned for some outstanding work.

Folds for bitcask effectively re-open bitcask from scratch in read only mode, generating a new reference: https://github.com/basho/riak_kv/blob/2.1.7/src/riak_kv_bitcask_backend.erl#L348.

Closing a bitcask, just:

1 - erases the state from the reference
2 - closes a write file (if it was opened for writing)
3 - closes all the read files (note that files in snapshots have separate file handles to the actual vnode bitcask instance).

from leveled.

martinsumner commented on July 20, 2024

HanoiDB

HanoiDB starts a new gen_server when hanoidb:open is called (and the riak backend code for hanoidb uses open rather than the open_link alternative - https://github.com/basho-labs/riak_kv_hanoidb_backend/blob/master/src/riak_kv_hanoidb_backend.erl#L111).

When a file is being written, a new process is started to write the file. When a snapshot is started a new fold_worker is started (plain_fsm). The writer is started and linked to the worker process that started it (not via a supervisor). The fold_worker sets up a monitor of the hanoidb gen_server process that spawned it.

So HanoiDB doesn't use supervisors, does have multiple worker process, but relies on monitors and links between worker processes where necessary.

from leveled.

martinsumner commented on July 20, 2024

Leveled

Just some notes on leveled, wrt supervisors:

It isn't obvious that it would be good to have individual processes restarted independently. If one process dies, all processes die, and everything should restart agains from scratch through the standard startup routine. This approach is easier to reason about and test. Leveled is designed to be one 1 vnode backend in a multi-node/multi-vnode Riak setup - so the temporary unavailability of one vnode due to a rippling exit and restart is not necessarily critical.
Generally, all processes receive messages only from the parent process above them in the logical supervision tree (or a clone of that parent process). We generally don't need to restart a process because something else crashed the process.
Primary concern wrt what may crash processes is temporary issues with mount points, and corruption to files. If there are issues with mount points, then it is unlikely to impact specific processes, or handled more efficiently just by restarting some processes. There is already some in-built handling for corrupted files, but this seems more of a file format problem than a supervision one.

Temptation at the moment is just to substitute start for start_link. Then build better handling for different file corruption scenarios into the code cdb/sst code itself.

from leveled.

martinsumner commented on July 20, 2024

This now has switched to using start_link not start (except for when starting clones which are not linked).

#150

from leveled.

martinsumner commented on July 20, 2024

This has now been demonstrated to be a robust approach. So although this might not be ideal for an idealistic setup for an erlang project, there is no motivation to address this as technical debt

from leveled.

Supervision of Clerks, Supervision in General about leveled HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent