Pls Run some jobs

Problem

Running arbitrary linux processes on remote machine.

Approach

Library

Individual commands are started as child processes. Handle is owned by Job, which streams logs of the process to a file and collects exit code. As a consequence, in the absence of cleanup policy chatty processes would quickly exhaust available disk space.

Jobs are grouped together and owned by Controller, which serves as a container for jobs. Controller, deals with various aspects of spawning and responds to queries.

Cgroups

Library leverages cgroups to enforce resource limits. V2 is chosen for its simpler design (single hierarchy). Since controllers could be part of only one hierarchy and systemd mounts some controllers to v1 by default, cgroup_no_v1=all kernel boot param is used to opt-out of this behaviour and use unified (v2) hierarchy instead. See playbook for implementation details.

Following controls are supported:

cpu.weight: provide value in [1, 100000] to set weight for spawned process (default 100), relative to other client processes.
memory.high: provide value in bytes to cap memory, process using more than this value will be throttled.
memory.max: provide value in bytes to hard-cap memory, process using more than this value will be OomKilled.
io.max.rbps & io.max.wbps: provide value to cap io in bytes per second.

Running as root

Adding process to a cgroup requires eid of the process performing the move to be the owner of common ancestor of source and destination cgroups. In other words, for child C of process P belonging to non-root user U, to be moved in a cgroup owned by U, P itself needs to be in a cgroup that is direct ancestor of target cgroup and owned by U.
This could be achieved by setting up appropriate permissions beforehand as privileged user and executing binary as U. For this implementation it is a no-goal and process is ran as root.

Flow

start request is forwarded to controller, which creates child cgroup for that particular job. After child process is fork'ed but before exec'ed, child process runs pre_exec closures. Child process adds its pid to cgroup created for this particular job by calling getpid and writing result to cgroup.procs. This call is followed by setgid and setuid to ensure job does not run as privileged user.

Authn/z

Authentication is implemented with mTLS. In a production scenario job-runner service provider would leverage their own CA, to generate chain of trust. Each client (in the business sense, as an organization) could be issued intermediate CA, which in return would be used to issue end entity certificates. Combination of Organization & emailAddress values in the subject field could be used for fine-grained control of client (and end entity associated with such client) capabilities. For this implementation such flexibility is a no-goal, instead authentication is implemented with a simpler scheme of single root CA, issuing end-entity certificates for client and server.

Authorization is done via job ownership scheme, leveraging Organization value read from the subject field of the end-entity certificate provided by caller.

System uses TLS v1.3 and EEC.

Keys are generated with openssl genpkey -algorithm ed25519. Curve is used in the wild and considered safe by safecurves.

Certificates are generated with:

# CA
openssl genpkey -algorithm ed25519 > ca/key
openssl req -new -x509 -key ca/key -out ca/cert -config $CONF
# Server 
openssl genpkey -algorithm ed25519 > server/key
openssl req -new -key server/key -out server/csr -config $CONF
openssl x509 -req -in server/csr -extfile $CONF -extensions server_ext -out server/cert -CA ca/cert -CAkey ca/key 
# Client
openssl genpkey -algorithm ed25519 > client/key
openssl req -new -key server/key -out client/csr -config $CONF
openssl x509 -req -in server/csr -extfile $CONF -extensions client_ext -out client/cert -CA ca/cert -CAkey ca/key

Click to view openssl config

[ req ]
default_bits        = 4096
default_md          = sha512
distinguished_name  = req_distinguished_name

[ req_distinguished_name ]
countryName                     = Country Name (2 letter code)
stateOrProvinceName             = State or Province Name
localityName                    = Locality Name
0.organizationName              = Organization Name
organizationalUnitName          = Organizational Unit Name
commonName                      = Common Name
emailAddress                    = Email Address
countryName_default = CA 
stateOrProvinceName_default = BC 
localityName_default = Vancouver

[ client_ext ]
basicConstraints        = CA:FALSE
subjectKeyIdentifier    = hash
authorityKeyIdentifier  = keyid, issuer
keyUsage                = critical, nonRepudiation, digitalSignature, keyEncipherment
extendedKeyUsage        = clientAuth, emailProtection

[ server_ext ]
basicConstraints        = CA:FALSE
subjectKeyIdentifier    = hash
authorityKeyIdentifier  = keyid, issuer:always
keyUsage                = critical, digitalSignature, keyEncipherment
extendedKeyUsage        = serverAuth
subjectAltName = @alt_names

[alt_names]
DNS.1 = localhost
DNS.2 = 127.0.0.1
DNS.3 = ::1

Application leverages rustls for crypto needs, which had been audited by Cure53 on behalf of CNCF in 2020.

gRPC

tonic is default gRPC server/client library in Rust ecosystem.

Click to view protobuf

syntax = "proto3";

package runner;

service JobRunner {
  rpc Start(JobRequest) returns (JobId);
  rpc Stop(JobId) returns (Ack);
  rpc Status(JobId) returns (JobStatus);
  rpc Output(JobId) returns (stream LogMessage);
}

message JobRequest {
  message CpuControl { uint32 cpu_weight = 1; }

  message MemControl {
    uint64 mem_high = 1;
    uint64 mem_max = 2;
  }

  message IoControl {
    uint32 major = 1;
    uint32 minor = 2;

    uint64 rbps_max = 3;
    uint64 wbps_max = 4;
  }

  string executable = 1;
  optional CpuControl cpu_control = 2;
  optional MemControl mem_control = 3;
  optional IoControl io_control = 4;
  repeated string args = 5;
}

message Ack {}

message JobId { bytes jobid = 1; }

message JobStatus {
  oneof outcome {
    int32 exit_code = 1;
    int32 signal = 2;
  }
}

message LogMessage {
  enum Fd {
    out = 0;
    err = 1;
  }
  Fd fd = 1;
  bytes output = 2;
}

Pics

Server startup state transition

Notes:

Upon parsing certificates server process has knowledge of finite set of clients who will be calling it. This knoweldge is leveraged to create distinct linux users named after Organization value in subject field.
For each of those users, base directory (for jobs to run in) and organization-level cgroups are created. Information about created users, cgroups and dirs is stored on Controller associated with specific user.
Upon receiving Start(JobRequest), server process picks the organization name from the supplied certificate and dispatches request to controller for that specific user.

Server startup state transition mermaid

graph TD
    S[Startup]-->RC
    RC[Reading Certs]
    RC-->|Got users from certs Organization value|EU
    RC-->|cert read fail|E[Fatal error -> process.exit]
    EU[Ensuring Org-level Users]
    EU-->|Users exist|CCG[Creating user cgroups]
    EU-->|Users don't exist|UA(useradd)
    UA-->|Created|CCG
    UA-->|useradd fail|E 
    CCG-->|cgroups ok|CD[Creating user dirs]
    CCG-->|cgroups fail|E 
    CD-->|Dirs ok|L[Listen for incoming requests]
    CD-->|Dirs fail|E

Start Job

Notes:

Command has pre_exec method on it. I believe that for this use case leveraging pre_exec + spawn is semantically equivalent to using fork() and exec().
As mentioned earlier Job is an abstraction over child process which owns the handle and enables communication with other actors in the system.

Start Job mermaid

sequenceDiagram
actor CA as Client A
participant S as Job Runner Service 
participant J as Job
participant CRA as Controller for Client A
participant FS as File Storage

CA->>+S: start: <br>curl example.com <br>cpu.weight=500
S->>CRA: new Job request for Client A
CRA->>J: new()
J->>CRA: JobId
CRA->>CRA: set up cgroup controls for JobId
CRA->>J: fork()
J->>J: Write getpid() to cgroup.procs
J->>J: setgid() and setuid() to Client A user
J->>J: exec()
J->>CRA: Ok
CRA->>CRA: Store Job Handle
CRA->>S: JobId
S->>-CA: JobId
J->>FS: Stream output to file

Status

Notes:

Status could be updated by Job task using Arc<RwLock<JobOutcome>>. Alternative approach I've considered was to use Arc<AtomicU16>, but that would require magic numbers. Given that value is updated once when the job is complete or terminated I believe RwLock is appropriate synchronization tool to use.

Job status mermaid

sequenceDiagram
actor CA as Client A
participant S as Job Runner Service 
participant CRA as Controller for Client A

Note over CA,CRA: Job with id JobId for Client A was started earlier
CA->>S: Status(JobId)
S->>CRA: Status for JobId pls
alt Job still running
CRA->>S: Id exists, no status, must be running
S->>CA: running
else Job exited
CRA->>S: Id exists, has exit_code
S->>CA: exit_code
else Client kills Job
CRA->>S: Id exists, has signal
S->>CA: signal
end

Output

Notes:

Considered keeping logs in-memory, but I believe this approach has couple of drawbacks. Reads and writes need to be synchronized somehow, so our options are either introducing Mutex, which may become fairly hot (depends on the volume of logs), or making Job own the output of the child process instead of streaming to file. This consequently means that all job output is now in main process memory, regardless of whether it will ever be read or not.
Assuming job is still running, only logs that were generated at the moment of request would be sent back. To address that, controller could leverage inotify (intention is to use inotify-rs). Given log streamer task read EOF from open file, instead of exiting it select!'s on one of two events - either job done, in which case no new logs were generated or IN_MODIFY event, in which case file is opened again, seek'ed to wherever cursor was left and streaming to client resumes.

Job output mermaid

sequenceDiagram
actor CA as Client A
participant S as Job Runner Service 
participant CRA as Controller for Client A
participant J as Job
participant FS as File System

Note over CA,FS: Job with id JobId for Client A was started earlier
J-)FS: Streams logs
CA->>S: Output(JobId)
S->>CRA: Output for JobId pls
CRA->>FS: Open Log files for JobId
FS->>CRA: Here is the handle
CRA->>S: Stream contents back
S->>CA: Here are your logs

Stop

Stop Job mermaid

sequenceDiagram
actor CA as Client A
participant S as Job Runner Service 
participant CRA as Controller for Client A
participant J as Job
participant FS as File System

Note over CA,FS: Job with id JobId for Client A was started earlier
J-)FS: Streams logs
CA->>S: Stop(JobId)
S->>CRA: Stop JobId pls
CRA->>J: Early out
J->>FS: Flush logs
J->>J: Kill child process
J->>CRA: Done
CRA->>CRA: Update status for JobId
CRA->>S: Done
S->>CA: Ack

Authz

Authz mermaid

sequenceDiagram
actor CA as Client A
actor CB as Client B
participant S as Job Runner Service 
participant CRA as Controller for Client A
participant CRB as Controller for Client B

Note over CA, CB: Client B gets JobId of Client A somehow
CB->>S: Output(JobId)
S->>CRB: Output for JobId pls
CRB->>S: No such Job
S->>CB: Not not-found

CLI

For cli, aiming to use structopt which is a great tool for parsing cli arguments. Api could look somewhat like this:

<bin> start -e curl -a example.com -cpu-weight 250 -memory-max 3145728 -u localhost:6666 -f cert
// => JobId: abc-xyz
<bin> status abc-xyz -u localhost:6666 -f cert
// => exit_code 0

Typing params that do not change from request to request gets very old very fast, in production environment presence of config file with default profile in well-known location (~/.config/.runner.json) could be used instead.

High Availability

In its current state service runs on a single host. To move job runner into the world of high availability, we would need to introduce some indirection. One way to go about it is to add routing layer which has the knowledge of which controllers run on what machines, and consequently which jobs are executed on which machines. That way, requests of particular client could be forwarded to a subgroup of hosts.
This introduces a whole set of interesting challenges such as state checkpoints (in case of required process migrations), and further design decisions depend heavily on the business problem clients are trying to solve.

lainera / pls Goto Github PK

pls's Introduction

Pls Run some jobs

Problem

Approach

Library

Cgroups

Running as root

Flow

Authn/z

gRPC

Pics

Server startup state transition

Start Job

Status

Output

Stop

Authz

CLI

High Availability

Recommend Projects

Recommend Topics

Recommend Org