Code Monkey home page Code Monkey logo

tango's Introduction

Tango

Tango is a standalone RESTful Web service that runs and manages jobs. A job is a set of files that must satisfy the following constraints:

  1. There must be exactly one Makefile that runs the job.
  2. The output for the job should be printed to stdout.

Example jobs are provided for the user to peruse in clients/. Tango has a REST API which is used for job submission.

Upon receiving a job, Tango will copy all of the job's input files into a VM, run make, and copy the resulting output back to the host machine. Tango jobs are run in pre-configured VMs. Support for various Virtual Machine Management Systems (VMMSs) like KVM, Docker, or Amazon EC2 can be added by implementing a high level VMMS API that Tango provides.

A brief overview of the Tango respository:

  • tango.py - Main tango server
  • jobQueue.py - Manages the job queue
  • jobManager.py - Assigns jobs to free VMs
  • worker.py - Shepherds a job through its execution
  • preallocator.py - Manages pools of VMs
  • vmms/ - VMMS library implementations
  • restful_tango/ - HTTP server layer on the main Tango

Tango was developed as a distributed grading system for Autolab at Carnegie Mellon University and has been extensively used for autograding programming assignments in CMU courses.

Using Tango

Please feel free to use Tango at your school/organization. If you run into any problems with the steps below, you can reach the core developers at [email protected] and we would be happy to help.

  1. Follow the steps to set up Tango.
  2. Read the documentation for the REST API.
  3. Read the documentation for the VMMS API.
  4. Test whether Tango is set up properly and can process jobs.

Python 2 Support

Tango now runs on Python 3. However, there is a legacy branch master-python2 which is a snapshot of the last Python 2 Tango commit for legacy reasons. You are strongly encouraged to upgrade to the current Python 3 version of Tango if you are still on the Python 2 version, as future enhancements and bug fixes will be focused on the current master.

We will not be backporting new features from master to master-python2.

Contributing to Tango

  1. Fork the Tango repository.
  2. Create a local clone of the forked repo.
  3. Install pre-commit from pip, and run pre-commit install to set up Git pre-commit linting scripts.
  4. Make a branch for your feature and start committing changes.
  5. Create a pull request (PR).
  6. Address any comments by updating the PR and wait for it to be accepted.
  7. Once your PR is accepted, a reviewer will ask you to squash the commits on your branch into one well-worded commit.
  8. Squash your commits into one and push to your branch on your forked repo.
  9. A reviewer will fetch from your repo, rebase your commit, and push to Tango.

Please see the git linear development guide for a more in-depth explanation of the version control model that we use.

License

Tango is released under the Apache License 2.0.

tango's People

Contributors

20wildmanj avatar akhilnadigatla avatar ashleyzhang avatar cg2v avatar clairecw avatar cool00geek avatar damianhxy avatar dependabot[bot] avatar devanshk avatar dlbucci avatar droh avatar evanyeyeye avatar fanpu avatar huahan98 avatar icanb avatar jlge avatar linnil1 avatar loicgelle avatar mihirpandya avatar mojojojo99 avatar nayak16 avatar oliverli avatar victorhuangwq avatar wongsingfo avatar xinyis991105 avatar ymzong avatar yrkumar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tango's Issues

Losing permission bits

The permission bits of files are lost as they are uploaded to Tango (by /upload). As a result, certain scripts that need +x to run, cannot run. We should either find a way to preserve these bits during upload or tell front-end to not rely on those permission bits and do bash hello.sh in their Makefile instead of ./hello.sh.

Verbose ssh output

Instrument tango so that it can be configured to dump ssh verbose output to the logs. This would be a huge help in debugging ssh/scp issues.

213 labs are not autograding (high priority)

With the exception of proxylab, 213 uses legacy .rb files that overload functions like autogradeInputFiles and parseAutoresult. None of the labs are autograding. Tango returns a status code of -3 to the front-end with no other feedback.

Kosbie uses similar legacy .rb files, so he's going to be running into this issue as well.

Tango stalls on destroyVM indefinitely

If the VMs on a backend machine become wedged, Tango stalls indefinitely during the retry waiting for destroyVM to finish. Instead, Tango should timeout on waitVM, create a new instance, and retry waitVM on that instance. Eventually, the wedged machine will fill up with VMs, but this will allow Tango to degrade gracefully while we are waiting to restart the backend.

Tango3 jobs are taking too long

The 213 datalab is taking about 7-8 seconds to run, with up to 3-4 seconds required for the scp-based copyin step, which should be almost instantaneous. In the old system, datalab took 2-3 seconds. In the past, this kind of slow-down has been due to authentication issues that required the ssh client to retry using different authentication schemes.

It would be easy to diagnose if there were a way to tell Tango to call ssh with the verbose option (-vvvv) and then dump the ssh output to the tango log.

`elapsed_secs` field in `info` endpoint is wrong

When I hit /info endpoint in the API, the elapsed_secs variable is shown as the current epoch time (e.g. 1429372581) instead of the actual number of seconds elapsed.

May you forgot to save the time when Tango was started?

Crash Vulnerability in Job Manager

Currently, jobs are removed from the liveJobs queue to the deadJobs queue. If the JobManager goes down before a job is added to the deadJob queue, the job will be lost.
Possible solution: only remove from liveJobs queue when adding to deadJobs queue succeeds with no error.

Multiplexing VMMS per job

Jobs can pick which VMMS they want to run on. Some can run on Tashi and some on Distributed Docker.

json output is hard to parse

The JSON output of getInfo and getPool uses JSON Arrays of ["a=X", "b=Y"] instead of Objects {"a":X, "b": Y}. This makes them non trivial to parse, since the client has to dive into the string rather then letting a JSON library do all the work for it.

Is there a reason for doing things this way?

Writing unit tests for Preallocator

I'm not entirely sure about the Preallocator implementation so I wasn't able to do it, but we definitely need this.

I already created a boilerplate at tests/testPreallocator.py

UTC regression

It's baaaaack! When I autograded this program the local time was 6:54, but Tango is reporting UTC instead:

_begin_
Autograder [Mon Aug 24 22:54:03 2015]: Received job [email protected]:82
Autograder [Mon Aug 24 22:54:11 2015]: Success: Autodriver returned normally

Autograder [Mon Aug 24 22:54:11 2015]: Here is the output from the autograder:

Autodriver: Job exited with status 0
...
_end_

However, the times in the job trace are correctly reported using local time:
_begin_
Runtime Trace
2015-08-24 18:54:03 | Added job [email protected]:82 to queue
2015-08-24 18:54:03 | Dispatched job [email protected]:82 [try 0]
2015-08-24 18:54:03 | Assigned job [email protected]:82 existing VM
...
_end_

Tango should report local times in the job trace and the logfiles

In the feedback that Tango returns to the client, it should be using local times rather than UTC. For example, I submitted this job at 11:07am EST:

Autograder [Thu Jan 15 16:07:46 2015]: Received...
Autograder [Thu Jan 15 16:07:50 2015]: Success: Autodriver returned normally
Autograder [Thu Jan 15 16:07:50 2015]: Here is the output from the autograder:

Recent change in MD5 hash function seems to break Autolab integration

Hi there,

I came across an issue where if I deleted an assessment in Autolab and reuploaded it, Tango would grade the latest file submitted to the previously deleted assessment when a new file was submitted to the reuploaded assessment (even if the files differ). Since there was an update in commit 050e5fc which had to do with MD5 checking, I updated Tango to see if this solved my issue. Unfortunately, this lead me to this error in Autolab:

screenshot from 2015-09-15 01-46-31

After doing some digging in the source, it seems like the output from Tango's open() function in tangoREST.py is incompatible with what Autolab expects since the commit mentioned above. As you can see from the screenshot, the error finally occurs in tango_upload in Autolab's autograde.rb, but is due to the existing_files variable which is obtained from a TangoClient.open(..) call.

Thanks for looking into this.

resetTango should validate and clean up the machines dictionary

I have encountered a couple of situations in testing where the preallocator's machines datastructure got out of sync. There were machine ids in the list that no longer had an entry in the relevant queue, which prevented any jobs from being scheduled on that machine. If all machines get in that state, job scheduling would halt. #77 fixes some of these, but it probably makes sense to either reset the preallocator state or validate it when tango is reset.

Nasty 15-381 bug

There is a corner case (experienced by the 15-381) TA that causes Tango to improperly use a previously cached version of a submission file.

Tango3 jobs are taking too long

The 213 datalab is taking about 7-8 seconds to run, with up to 3-4 seconds required for the scp-based copyin step, which should be almost instantaneous. In the old system, datalab took 2-3 seconds. In the past, this kind of slow-down has been due to authentication issues that required the ssh client to retry using different authentication schemes.

It would be easy to diagnose if there were a way to tell Tango to call ssh with the verbose option (-vvvv) and then dump the ssh output to the tango log.

/pool has unexpected behavior when there are no VM pools

When there are no VM pools (for example, when Tango has just started). The /pool endpoint returns {"pools": {}, "statusMsg": "Pool not found", "statusId": -1} complaining that the image name is invalid. This causes unexpected behavior on the frontend Autolab job status page.

Tango3 is not verifying job requests

If the user specifies a non-existing VM image, Tango should return immediately with a validation error. It doesn't. Instead, Tango runs waitVM three times in an unsuccessful attempt to start a non-existing VM image.

Tango should be doing extensive validation of job requests, which it apparently isn't.

validateJob does not check for a Makefile

Any TangoJob must have a Makefile since the autodriver runs make in order to start running a job. validateJob should ensure that a Makefile is part of the input files. If not, it should reject the job.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.