Code Monkey home page Code Monkey logo

saturn's Introduction

Documentation Status

 Saturn: Optimized Training of Multiple Large Deep Learning Models

| Roadmap |

Saturn is a novel system for multi-model deep learning training that automatically optimizes jobs for highly efficient training. It automatically selects parallelization techniques, determines optimized resource allocations, and constructs execution schedules for submitted jobs. Applying Saturn for hyperparameter optimization or model selection requires only a few lines of code.

Hydra_Summary_Figure (1)

Saturn is designed to support extensibility, allowing users to specify new execution procedures that can be included in its optimization plan and search space. In this way, you can keep up with the latest advances in model execution optimizations without having to wait for library updates & changes.

Hydra Figures (1)

Install Saturn

To install Saturn, please read the instructions. We're always excited to hear about new use cases and details of your experience with Saturn, so feel free to contact us at [email protected] if you want to share news.

Framework Support

We currently prioritize PyTorch support, but Saturn's general techniques are framework-independent. We would welcome contributions for TensorFlow & Jax.

Contributing

We welcome contributions to Saturn. Areas of particular interest are an alternative solver (e.g. using reinforcement learning), new interfaces, dashboards, and ways to support online job submissions. Please let us know if you encounter any bugs or have any suggestions by submitting an issue.

You can join the Slack here: https://join.slack.com/t/saturn-dl/shared_invite/zt-267mfi3s4-ifUYLiJUtaVeGFcYe9vbxA or by scanning this QR code:

slack

Documentation

You can find the docs for Saturn here.

How to Cite this Work

If you use this system in an academic work, please cite our tech report as follows.

@article{nagrechasaturn,
  title={Saturn: An Optimized Data System for Multi-Large-Model Deep Learning Workloads (Information System Architectures)},
  author={Nagrecha, Kabir and Kumar, Arun}
}

The Team

Saturn is currently developed and maintained by Kabir Nagrecha at UCSD.

License

Saturn uses Apache License 2.0.

saturn's People

Contributors

knagrecha avatar knagrecha-nflx avatar ollie-robin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

saturn's Issues

Reinforcement Learning Solver

Summary

Currently, the only supported method of solving the SPASE problem is to use an MILP solver like Gurobi. Gurobi licenses are free for academic users but can be pretty expensive for everyone else. Open-source MILP solvers do not perform as well. I propose that we make a reinforcement-learning-based solver as an open-source alternative for users without access to Gurobi.

Key Proposal
A new RL solver for Saturn's SPASE problem.

What it would take
This would be a significant overhaul. RL is not as reliable as an MILP solver. We'd need a lot of testing to be sure this works.

@knagrecha @knagrecha-nflx

Publish to PyPi

Summary: Let's get Saturn on PyPi!

What it would take: Getting all of the packaging metadata ready to go. Not a process I am very familiar with so might require some iteration.

Outcomes: Hopefully make it easier for our users to get setup!

Create examples with LLaMA [recommended for new contributors!]

Would be great to have some examples using a very new LLM like LLaMA.

Expected dev time: <3 days

Requirements: A new examples folder, like the existing WikiText103 one, but using a LLaMA model (possibly from HuggingFace). The training script and HPO spec should largely be the same. We just need to pull over some data-loading code & the model-arch loading code.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.