Comments (3)
thanks for findings. We are working on the kubernetes solution in platform team in the next week.
The "PyTorchJob" operator/CR from standard Kubeflow training operator allows us to run multiple processes within single container in a pod (Master pod)
we will be testing it with kubeflow training operator. we will update when work is done as part of issue #88
from fms-hf-tuning.
There is also the third option where the processes are distributed across multiple Kube pods, but this may be over-complex. This would be the standard Kubeflow training operator approach.
The "PyTorchJob" operator/CR from standard Kubeflow training operator allows us to run multiple processes within single container in a pod (Master pod) like the option 1 when we just want to run a multi-gpu single node training job. When we wish to spawn multi-node multi-gpu job, then we would leverage the worker pod where distributed environment variables (node rank, master address, port etc) are automatically injected by the operator. We just simply replicate accelerate launch in the worker pod and the node rank from the operator determines whether the pod is a worker pod or not. Also there is local rank created by torch.distributed which differentiates between all the processes.
In option 1, AFAIK, most of the popular container runtimes are multiprocess friendly and on the resource side, the resource requests and limits are container level in Kubernetes.
from fms-hf-tuning.
If we consider the constrained problem of running gpu jobs within a single pod, where each gpu is handled by a single process. There are the following options where mutiple processes are run:
- all together in a single container, within a single Kube pod
- individually each within a container, all containers housed within a single Kube pod.
There is also the third option where the processes are distributed across multiple Kube pods, but this may be over-complex. This would be the standard Kubeflow training operator approach.
Huggingface's recommendation is to run distributed training jobs using accelerate.
- build on top of
torchrun
. The main process will spawn multiple worker processes for distributed training. torchrun
has a watchdog agent that handles things like fault tolerance.- processes communicates via various backends (e.g.
static
orc10d
). - GPU communicates over GPU-network interfaces (e.g., NCCL).
Option 1:
- is docker compliant to run multiple processes to parallelise work (e.g., running workers to parallelize an SQL query). This is very much like a master process spawning many child processes for distributed training.
- has some fault tolerance capabilities already built in.
Option 2:
- although the HF blog seems to claim that accelerate is backward compatible with
torchrun
, not sure how much of it is true. Certainly in the code there are a lot of acclerate specific flags that will only be set byaccelerate launch
. for kube jobs it is mostly recommended to use the pytorch job controller, here is an example for distributed cpu jobs that can be extrapolated to GPUs. THey useThis is incorrect as this will launch each worker in its own pod.torchrun
, but probably also works foraccelerate launch
(due to the similarities in the API)
from fms-hf-tuning.
Related Issues (20)
- Prompt Tuning returns low-quality results HOT 33
- Test SFTtrainer image HOT 1
- feat: standardize the format for metrics, operations, controls in the yamls used by `TrainerControllerCallback`
- bug: AIM package being installed causes the trainer to expect the AIM server to be running. HOT 1
- feat: Expose the trainer state as a trainer controller metric
- feat: Exposed the evaluation metrics for rules within trainer controller
- bug: build output and auto-generated file are not ignored
- feat: support for robust benchmarking of fms-hf-tuning HOT 8
- Contribute ADR for Acceleration Framework Idea
- bug: `eval` is still not safe even with checks for `__` and `"__builtins__": None`
- bug: logging_steps greater than one results in TypeError when evaluating trainer controller rule HOT 2
- Add unit tests for tuning/sft_trainer.py HOT 1
- Add unit tests to tuning/utils/config_utils.py HOT 4
- Add unit tests for tuning/utils/merge_model_utils.py HOT 4
- Document the linting process.
- bug: Boolean values are represented as strings in default fsdp config translates to True HOT 8
- bug: Using more than 1 GPU causes random stalls and exceptions HOT 9
- Update launch training for multi GPU training
- Wrong repo - deleted
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fms-hf-tuning.