Code Monkey home page Code Monkey logo

grun's Introduction

* Description:

A lightweight replacement for job queueing systems like LSF, Torque, condor, SGE, for private clusters.

* Installation:

You need perl and ZMQ::LibZMQ3

    apt-get install libzmq3
    cpanm ZMQ::LibZMQ3

Put it in /usr/bin.   Type grun -h for help with the configuration/setup.  Basically you stick it on all the machines, and you put a "master" in the main config.  It is, far and away, the easiest grid job queuing system out there.

Example /etc/grun.conf on the master node:

master:    foo.mydomain.local
services:   queue
log_file:  /var/log/grun.log

Example /etc/grun.conf on a many compute nodes:

master:    foo.mydomain.local
services:   exec
log_file:  /var/log/grun.log

Once you have those two daemons running (grun -d), then you can, on either node, do this:

    grun hostname

And it will show you which node the command was run on, confirming that your installtion works.

* What you do with it:

Submit jobs, grun puts them in a queue, the queue is on disk so machines can lose connection any time, and jobs keep going.   Specify resouces requirements, or not - and have it figure stuff out.

grun will assign jobs randomly, and will tend to fill up one machine at a time, in case big jobs come along.   It should figure NFS mounts, and the right thing to do with them autoamtically.  It has a fast i/o subsystem suitable for terabyte files.  If all the machines are busy, jobs will queue up.  Resources are soft-locked by jobs until they are finished, and if a non-grun job is using a machine, it's accounted for. 

NOTE: Jobs that require changing sets of resources should call grun themselves - forcing later commands to requeue:

IE:

> grun -m 3000 myjob.sh

--- myjob.sh: ---
perl usesalotofmemory.pl
grun -c 4 -m 500 perl uses4cpusbutlessram.pl

The memory and cpu usage of the parent program will go to zero as the second script is launched.   In this way very complex jobs can be scheduled, without a special workflow system.  Just use bash, perl or ruby.

* What it does now:

It does the queueing, i/o, node-matching to requirements, suspend/resume, priorities, and has a great config system.   The "auto conf" allows sysadmins to create policies based on the parameters or script contens of a job.  Grun has run millions of jobs in a production cluster with about 52 nodes of 24-48 cores each.  Grun times every command and records memory/cpu/io.  Ulimits are set on jobs, and can be scaled to a multiple of requested (elbowing).   Jobs are editable, killable & suspend/resumable.  You can specify "make like" semantics on jobs and it will check inputs/outputs and run jobs only if needed.
 
Grun comes with a perl module that allows you to execute perl functions on remote hosts by copying the memory and code over to the remote machine. For example:

    use Grun;
    my $results = grun(\&mycrazyfunction, "param");

It decompiles the function locally, and recompiles remotely.

* What it doesn't do yet (TODO):

- needs a "plugin" architecture.  Should be easy with perl.
- graceful restart... with socket handoff (it's always safe to restart... but it can cause waiters to pause)
- harrass users when queue drive is close to full, or cpu/ram was too high to ever run, or other (configurable) issues that users have

* Goals:

Small           Keep the code under a few k lines, should be readable.   Lots of comments.
Simple          Easy configuration, guesses default values
Smart           Keeps track of stats.  Learns what to do with things.   Puts the right jobs on the right machines.
Fast            Hundreds of thousands of jobs per hour on tens of thousands of machines should be easy.
Configurable    Make up lots of new things you want it to keep track of and allocate to jobs (like disk i/o, user limits, etc.)

* Features to avoid:

No security     Grun has no security.   It has to be behind a firewall on dedicated machines.  This limits it, and keeps it simple.  It's not hard to put up an ssh tunnel, and make it work at EC2.   But I'm not building in kerberos or stuff like that.

One platform    Grun only works on unix-like machines that have perl installed.

Nothing fancy   Grun doesn't support MPI and other fancy grid things (although you can layer map-reduce on it)

* Advanced usage: 

Example of a config on an ssh tunneled EC2 node (tested), with autostaging via gwrap:

services:       exec
master:         127.0.0.1:5184
bind:           127.0.0.2:5184
wrap:           /usr/bin/gwrap
fetch_path:     /opt:/mnt/ufs

gwrap (source only) is a useful tool.  Used right, it will copy all the dependent files to the remote host, only as needed, and then, when the remote program is finished, it will copy all the output files back to the local host.   Of course, this *could* be built in to grun, via the -M option, but it isn't yet.   gwrap is not yet heavily used in production, and if you do decide to use it heavly, please let me know... I'll help add things like logging, failure recovery as needed.


grun's People

Contributors

earonesty avatar mattnashbrowns avatar tuetschek avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.