Code Monkey home page Code Monkey logo

nemo-megatron-launcher's Introduction

NeMo Framework Launcher

The NeMo Framework Launcher is a cloud-native tool for launching end-to-end NeMo Framework training jobs.

See the NeMo Framework User Guide for the most up-to-date information and how to get started quickly.

The NeMo Framework focuses on foundation model training for generative AI models. Large language model (LLM) pretraining typically requires a lot of compute and model parallelism to efficiently scale training. NeMo Framework includes the latest in large-scale training techniques including:

  • Model parallelism
    • Tensor
    • Pipeline
    • Sequence
  • Distributed Optimizer
  • Mixed precision training
    • FP8
    • BF16
  • Distributed Checkpointing
  • Community Models
    • LLAMA-2

NeMo Framework model training scales to 1000's of GPUs and can be used for training LLMs on trillions of tokens.

The Launcher is designed to be a simple and easy to use tool for launching NeMo FW training jobs on CSPs or on-prem clusters. The launcher is typically used from a head node and only requires a minimal python installation.

The Launcher will generate and launch submission scripts for the cluster scheduler and will also organize and store jobs results. Tested configuration files are included with the launcher but anything in a configuration file can be easily modified by the user.

The NeMo FW Launcher is tested with the NeMo FW Container which can be applied for here. Access is automatic. Users may also easily configure the launcher to use any container image that they want to provide.

The NeMo FW launcher supports:

  • Cluster setup and configuration
  • Data downloading, curating, and processing
  • Model parallel configuration
  • Model training
  • Model fine-tuning (SFT and PEFT)
  • Model evaluation
  • Model export and deployment

Some of the models that we support include:

  • GPT
    • Pretraining, Fine-tuning, SFT, PEFT
  • BERT
  • T5/MT5
    • PEFT, MoE (non-expert)

See the Feature Matrix for more details.

Installation

The NeMo Framework Launcher should be installed on a head node or a local machine in a virtual python environment.

git clone https://github.com/NVIDIA/NeMo-Megatron-Launcher.git
cd NeMo-Megatron-Launcher
pip install -r requirements.txt

Usage

The best way to get started with the NeMo Framework Launcher is go through the NeMo Framework Playbooks

After everything is configured in the .yaml files, the Launcher can be run with:

python main.py

Since the Launcher uses Hydra, any configuration can be overridden directly in the .yaml file or via the command line. See Hydra's override grammar for more information.

Contributing

Contributions are welcome!

To contribute to the NeMo Framework Launcher, simply create a pull request with the changes on GitHub. After the pull request is reviewed by a NeMo FW Developer, approved, and passes the unit and CI tests, then it will be merged.

License

The NeMo Framework Launcher is licensed under the Apache 2.0 License

nemo-megatron-launcher's People

Contributors

markelsanz14 avatar yaoyu-33 avatar shanmugamr1992 avatar dimapihtar avatar davood-m avatar piotrm-nvidia avatar maanug-nv avatar pziecina-nv avatar criztov avatar ericharper avatar madhukarkm avatar erhoo82 avatar zhenghax avatar arendu avatar roclark avatar sashameister avatar maximumentropy avatar jojennin avatar aroraakshit avatar aklife97 avatar jimmyzhang12 avatar thomasdhc avatar okuchaiev avatar wdykas avatar gshennvm avatar yidong72 avatar jbaczek avatar avolkov1 avatar kaiyux avatar ssh-meister avatar

Stargazers

Robin Winters avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.