twosigma / cook Goto Github PK

Fair job scheduler on Kubernetes and Mesos for batch workloads and Spark

License: Apache License 2.0

Clojure 65.65% Java 7.04% Shell 1.35% Python 25.64% Jupyter Notebook 0.21% Dockerfile 0.07% Makefile 0.02% Batchfile 0.02%

mesos spark cluster scheduler kubernetes gke

cook's People

Contributors

Stargazers

Watchers

Forkers

ruo91 icexelloss 99plus2 shamsimam wenbozhao dgrnbrg wyegelwel jhn ikerlan-digital cfregly losangle sdegler pinguo-liguo prasincs mforsyth mojotech ashwini-kr ralic rustyconover tnn1t1s brucezhou2012 pradeepchhetri mbrukman pjlegato phunghv m4ce hkothari rkrepsilon davvdg m4rvin dposada pschorf bnavetta subodhp frgomes etsangsplk daowen scrosby liam-f bwuebben dpes murraju nsinkov rmanyari jitkasempin yueri apoorvemohan hdrodz kathryn-zhou d3v3l0 cge0516 lhongjum gerrymanoim calebhar12 destin211 laurameng samincheva lewisheadden ahaysx feihu618 luongdev jordan-gillard sureshuppelli nicosql

cook's Issues

Move to Metrics library in mesos/monitor.clj to standardize metrics reporting

In mesos/monitor.clj, we currently report metrics on user waiting/running jobs/cpus/mem by sending riemann events directly. Instead, we should:

(1) Have a chime process query database and store the results in atoms/async-channels
(2) Have a go-loop that looks at the atoms/async-channels and register/deregister gauges.

Per discussion with @dgrnbrg

Does the spark patch actually need the remoteConfFetch code?

We need to test against kerberized hadoop to ensure that this isn't needed--it was added as a kerberos support hack, but its time might've expired.

Zookeeper needed for dev-config

In running lein run dev-config.edn I get

2015-09-21 19:21:17,472:22178(0x116b06000):ZOO_ERROR@handle_socket_error_msg@1697: Socket [::1:2181] zk retcode=-4, errno=61(Connection refused): server refused to accept the client
2015-09-21 19:21:17,472:22178(0x116b06000):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:2181] zk retcode=-4, errno=61(Connection refused): server refused to accept the client

which resolves once I start running a Zookeeper locally

2015-09-21 19:21:20,806:22178(0x116b06000):ZOO_INFO@check_events@1703: initiated connection to server [fe80::1:2181]
2015-09-21 19:21:21,191:22178(0x116b06000):ZOO_INFO@check_events@1750: session establishment complete on server [fe80::1:2181], sessionId=0x14ff236287a0000, negotiated timeout=10000
I0921 19:21:21.191696 327958528 group.cpp:313] Group process (group(1)@127.0.0.1:56667) connected to ZooKeeper
I0921 19:21:21.191776 327958528 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0921 19:21:21.191836 327958528 group.cpp:385] Trying to create path '/mesos' in ZooKeeper

Is this expected? From the documentation,

Cook is written in Clojure. To develop Cook, all you need is a JVM and Mesos installed and configured. Cook will automatically start embedded copies of the rest of its dependencies."

I thought I would not need any dependencies when running in dev-mode.

Cook shouldn't change instance status to failed unless it knows the task has failed

instance status should reflect the fact. However, we currently change instance status to failed to kill it. I think this is not ideal because when user sees instance status = failed in Cook, it should be the case that the task indeed fails, that is, cook receives task-failed/task-error/(maybe task-lost in some cases) from mesos.

The places we currently change instance status to failed are:
(1) To preempt a task
(2) To kill a task due to heartbeat timeout

@dgrnbrg let me know what you think

Make datomic pruning helper tool

Release metrics-clj to clojars

Logo for Cook

Cook should have a logo!

Document recommended cook jvm options

This can include datomic settings, g1 settings, etc

Add support for other databases

The first step here is determining the types of queries we do. This issue should be updated with the current list:

Find all jobs of a particular status
Find all non-terminal instances
Query a particular job or instance by ID

The status-related queries require second indices, but we could change instance IDs to be [jobid instanceid] pairs, so that we only need to implement lookup by job id, and then we'd just store a "document" with the full job & instance state.

Still to be analyzed:

What's the impact on the metrics reporter's user stats?
How critical are the transaction functions? Could we change them to run locally, or all be CAS-based?
Can we refactor the use of the Datomic txn log tailer to be totally local, core.async, and per-process?

Add support for HTTP Basic auth to Scheduler and Jobclient

This is necessary so that normal, non-kerberos environments can use cook.

Make an API to configure quotas

Currently, you need to transact quota changes. These should be in a config file or a REST Api.

Add support for URIs and Ports in endpoint

This is being worked on in the uris_and_ports branch. URI support is done; ports support is ongoing.

Document where libmesos is

Usually in libmesos.so / libmesos.dylib (Linux/Mac) are in /usr/lib or /usr/local/lib, but in my case, were in my $MESOS_BUILD_DIR/src/.libs/ -- need to understand why this was and document each edge.

Add support for ports in Cook tasks

Unable to start server from checkout

I checked out a8e1c67 and tried to run lein run dev-config.edn but it failed to run because of missing dependencies. It seems tags referenced on line 318 of components are not defined anywhere. I commented out the expression, but then I got another error about cook.reporter on line 324 and commented that expression out as well. After those changes I was able to run.

I was able to get it all running in less than 30 minutes, including pulling dependencies and debugging this. Thanks for making it easy.

Provide working example of using the agent

This will be a showcase for the heartbeat feature in the Cook scheduler.

Benchmark time to schedule a workload

This will give us an idea of how long it should take to start some number of jobs, of various sizes.

The motivation is to understand how long it should take to launch a Spark cluster, so that we can figure out how multitenancy affects this, and if something special is needed.

Add python client

#4 will make this 1000x easier.

Create new grpc based transports

Add graphite metrics reporting

Integrate fenzo as new scheduler

Release cook-jobclient

After discussion with @jcoveney, we've decided to release this to Maven central.

Add support for terminal task failure

Currently cook scheduler always retries job when it fails. However, sometimes executor can determine a job fails permanently and therefore there is no point in retrying, in this case, we should allow executor to tell cook scheduler not to retry the job.

To implement this, we can leverage data field in TaskStatus. We can start including metadata (a json map, maybe) along with TaskStatus and this will just be a "terminal-failure": "true" entry.

Cook scheduler can simply set job state to complete when it sees "terminal-failure": "true" from a task status

Test protobuf <-> datomic roundtrips

This is meant to test that we can submit some JSON through the rest api, see it hit Datomic, then convert that to a protobuf, then follow the whole roundtrip back. This could catch potentially unknown serialization/format munging bugs, since we represent job data as Clojure datastructures, Mesos protobufs, Datomic datoms, and JSON objects.

Add support for riemann reporter with metrics v3

On the earliest commits, it seems to have metrics v3 and v2 coexisting. I don't understand why that would work, but it should be possible to build this correctly using compatible versions.

Add travis.ci config

For building multiple projects, this article seems useful: https://lord.io/blog/2014/travis-multiple-subdirs/

Update spark to latest build

This includes using the new Cook Environment variables and URI APIs, integrating the latest code into spark 1.5, documenting the instructions for building off spark 1.5.

This should also add support so that the URI either uses Basic auth or kerberos, depending on if the URI is of the form cook://user:pass@host:port or simply cook://host:port.

Make the minimum safe quota and the weighting independently settable

I believe these are currently always the same number. Some users might get a bigger unkillable quota, but have an equal weighting in the sharing system.

Analyze TODOs under the Spark patch

They should be able to be resolved now.

Simplify and improve directions for building spark

This should be really easy to patch spark!

Add federation to Cook REST API

Here's an example of what the config file could look like:

 :federation {:remotes ["http://localhost:12322"]
              :priviledged-principal "admin"
              :threads 4
              :circuit-breaker {:failure-threshold 0
                                :lifetime-ms 60000
                                :response-timeout-ms 60000
                                :reset-timeout-ms 60000
                                :failure-logger-size 10000}}

Cannot start cook with dev/prod datomic

The issue is dev/prod datomic needs metatrasaction jar, but currenly metatrasaction is inside scheduler project so I cannot compile a standalone metatrasaction jar.

Make the HTTP API use the same restrictions as the scheduler

Currently, the API hardcodes 32 CPUs and 200GB ram for blocking HTTP job submission. The API should instead use the same scheduler constraints that are provided as :task-constraints.

lein uberjar from scheduler subdir failed

ljin@hsljin:~/ws/github/Cook/scheduler$ lein uberjar
Error: Exception thrown by the agent : java.rmi.server.ExportException: Port already in use: 5555; nested exception is:
java.net.BindException: Address already in use
Compilation failed: Subprocess failed

Document Scheduler configuration

Make sure that we have a sample dev config (should work out of the box) & prod config (should have comments to explain some choices).

This also should have details on the recommended production JVM options, and why to use them (Datomic using extra heap as cache, debugging GC pauses, etc).

All options should be documented in the asciidoc.

Running job status not updated in mesos 0.23 and Cook

I submitted a job via cook to a mesos 0.23 cluster. Everything seems to have worked fine, but the instances[0].status and framework_id are not getting set. On the mesos page, I do see the job as running and cook scheduler as a registered framework.

[
{
mem: 16,
max_retries: 3,
max_runtime: 86400000,
name: "cookjob",
command: "while [ true ]; do echo hello cook I am "$(whoami)" and MY_VAR="${MY_VAR}"; sleep 10; done",
env: {
MY_VAR: "foo1"
},
framework_id: null,
instances: [
{
start_time: 1444169356373,
task_id: "cd66e79b-9272-4d54-bbd3-e89cff8c78c0",
hostname: "some.host.domain.com",
slave_id: "20151006-201511-738201772-5050-93146-S8",
executor_id: "cd66e79b-9272-4d54-bbd3-e89cff8c78c0",
status: "unknown"
}
],
priority: 50,
status: "waiting",
uuid: "f76aa5bd-e4bb-4ef3-9ad4-5b2938efc0fd",
uris: null,
cpus: 0.5
}
]

Add Spark parameters for configuring Cook binding

Besides setting the CPUs and Memory for each executor, we should be able to specify additional URIs or environment variables to retrieve for the executor, and the min threshold of running executors to wait for until we start computing.

Propagate the reason field for failed tasks

The reason we show in json will be the enum value, but all lowercase, changing underscores to spaces, and dropping the leading reason.

Discussed with @icexelloss.

Ensure all scheduler tests pass

Some of the scheduler tests currently seem to have a deadlock in them.

Document adding the Cook jar to Datomic classpath for production

Cook's transaction functions expect to be able to call functions in Cook proper; the transactor will blow up if you don't do this, so we should document how to do so.

Change the way we load Mesos in travis to enable moving to travis container infra

This will require submitting a request to here: https://github.com/travis-ci/apt-source-whitelist

Or we can download and install/unpack/build (or grab binaries) Mesos ourselves

But this also has the problem/downside that to get them added to the whitelist, they seem to need source packages. And to use the cache (necessary for building the package), we'd need to be a paying Travis customer.

This is trickier than I initially thought.

Allow DNS and Hostname to be configurable

Either this should be configurable, or they're no longer in use. Either way, we should figure them out!

Provide an HTTP based job tracking endpoint

Users would like to be able to query Cook for the list of their running and waiting jobs. We've discussed this at length internally but I'd like to bring this to the open source for design, review and implementation.

Provide ability to update Priority of a Waiting job.

Users would like programmatic ability to sort the queue. More generally, can we provide the ability to update any mutable property of a job via an UPDATE method.

Remove "BasicKerberizedHttpClient" in JobClient

This is a wrapper that just slightly changes the HttpClient API. HttpClient already supports every feature in this class.

Add support for host constraints

This should be for things like "only on hosts w/ a specific attribute". This will enable things like GPU or machine class aware scheduling.

This will need to be added to the client-facing API, as well as to the scheduler & db.

Add support for Spark's Rest Server API

@tnachen pointed https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc7d00bf/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionServer.scala out to me. We could implement this for Cook to simplify submitting prod jobs to Spark via Cook.

Document all spark configuration options

We should have a table documenting all the properties that you can configure for the Cook spark binding.

Document how to build cook with datomic pro

Currently, datomic free edition jars are available in public maven repos, so lein is happy with building against it. But to use datomic pro, one has to maven install the licensed jars in local maven repo before building it. The documentation on that whole process is a little sparse. I found http://aan.io/datomic-pro-and-leiningen/ this useful after googling around.

Current documentation suggests that switching to datomic pro is as simple as s/datomic-free/datomic-pro/g project.clj.

Remove the default settings of properties in rest api implementation

Things like max-runtime and priority are being set and written to the DB, rather than picking up defaults in the scheduler. This makes the schema and DB use ugly and inefficient; these should be cleaned up.