intelligent-machine-learning / dlrover Goto Github PK

DLRover: An Automatic Distributed Deep Learning System

License: Other

Dockerfile 0.13% Shell 0.53% Makefile 0.13% Go 7.65% Python 74.20% Smarty 0.13% Starlark 0.28% C++ 11.82% Cuda 0.78% C 0.04% Jupyter Notebook 4.32%

distributed-training k8s llm-training

dlrover's People

Contributors

Stargazers

Watchers

Forkers

hxdtest workingloong samplise cailun01 fudp tingshua-yts antlera chenhuihu liutongxuan merlintang jiaowaa niconical wu-clone wangraying ironicbo hansontang01 misterbrookt zzhbrr major-333 ai-mou emp325 skydoorkai bianbian-yf ideny42 superleo yzs-lab ssby-zhy sylviasyp haozhijie9527 yzlnew meanchen hay-man e-kiss-me criptobe wbmc awekling dustaring nicbair fskeo coder-drinker yijan4845 web-logs2 matmua ruirui-zhang nicolesherwood monsterdove tufo830 herpacker nash635 jbluv apollowesley wensiyuansix 0x8235 d3p10y billionerd wlong692 andakai rayoluo guyiy tumaobig janilbols-w hwdef andydian william-wang mars1248 cheril liyzcj gavinbf wang90063 youxingling hao-gong gaoyang07 big-model ai-jie01 jackaihfia2334 zewenying schopenhauerzhang bruinxiong lihuibng fkyms statelesshz wodole mu-l zhaoyim crazyboystop petercao jiaoff-hub lamron-karl assassindesign lovefundream eastdfu cherishty johncruyff14 jack1981 adamantboy moranhhuishou1995 jqk6 qhaoduoyu felix0080 xinlzhang

dlrover's Issues

Update go version in the image for CI

Polish doc for dynamic data sharding.

Explain why we need dynamic data sharding and how it works.

ElasticJob operator launch the master Pod.

Support setting the image for dlrover master in a job yaml file.

Rename _initial_nodes as nodes_queue

_initial_nodes is a little confusing. _initial_nodes serves as a queue storing pods template. Once pod is created, pods template is removed from _initial_nodes.

Provide an ps/worker autoscaling example using deeprec.

Implement a Scaler to scale up/down Pods of a job in a k8s cluster

Dynamic data sharding

Dynamic data sharding can dispatch shards to workers during training and recover the task if a worker fails.

Use ray `placement group` for resource allocation and actor scheduling

Move dlrover/dlrover/trainer/examples/ to model_zoo

Provide synchronization for workers in training job

Add an online deep learning example.

Add a streaming reader transferring data from upstream to worker.
Using deeprec to implement delta model exportation.
Test streaming data manager/splitter and restore from checkpoints.
Add a example to illustrate online deep learning effects.

Create an EasyDL master Pod at jobCreated

Support manual scaling of a job

Build a TensorFlow image to test the ElasticJob on a minikube cluster

Call ray.init() when DLRover runs on a Ray cluster.

Add to hook to check pre-requisities when executing scale plan.

Add to hook to check pre-requisities when executing scale plan. Because in different occasion, the pre-requisities are different.

Read job configurations from a job CRD

Recover shard tasks if a worker fails.

[Fault-tolerance] Relaunch a deleted PS/Worker

The PS/worker Pod may be preempted during training. The ElasticJob should restore delete pods to support fault tolerance.

Add an state of art model training example in the field of recommendation and show the advantage of dlrover.

Directory design of EasyDL

How to organize directories of EasyDL?

|-brain # Automaticallly generates the resource plan of the job.
|-operator
  |-controllers
    |-elastic-job.   # Creates a k8s Job
    |-resource-scale  # Scale out or in the job resource according to the Custom Resource(CR)
|-elasticdl   # dispatches data shards to workers and monitors training nodes.
|-easydl  # APIs for the training loop of TensorFlow/Pytorch to use elastic training.

Which framework do we use to implement a trainer to support elastic training?
We need a trainer to catch the exception and rebuild the session if parameter servers change. Now, a trainer is implemented with tf.estimator framework in AntGroup. However, Keras is more common than tf.estimator and TF 2.x has supported training a Keras model using ParameterServerStrategy. In another way, we can implement a trainer based on tf.estimator and convert a Keras model to an estimator model in TensorFlow.

Test tensorflow sparse cluster for dynamically worker adjustment

Make Job's Brain Relevant Parameters Configurable in Job Yaml

Brain requires a job to specify a couple of parameters when to process the requests from the job, e.g., processor, data store, config retriever. Now those parameters are constants in the master source codes. In this way, it is inconvenient to update those parameters. Furthermore, all jobs are sharing the same configuration. Therefore, it'd be better to make those parameters configurable on the job's yaml.

HealthCheck for training jobs in master

check whether service is available.

ElasticJob opreator fails when to apply an existing job.

INFO[0291] jobName: elasticjob-sample, phase Running    
INFO[0291] Master elasticjob-elasticjob-sample-master is deleted and relaunch a new one elasticjob-elasticjob-sample-master 
INFO[0291] Pod elasticjob-sample-edljob-ps-0 is deleted and will be relaunched 
1.670830146192972e+09	INFO	Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference	{"controller": "elasticjob", "controllerGroup": "elastic.iml.github.io", "controllerKind": "ElasticJob", "elasticJob": {"name":"elasticjob-sample","namespace":"default"}, "namespace": "default", "name": "elasticjob-sample", "reconcileID": "c5e12fb0-74bb-404a-a2a6-d6ea166b7673"}
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x8 pc=0x103d4475c]

goroutine 269 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	/Users/lazylong/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:118 +0x1a0
panic({0x1040bca60, 0x104d06ea0})
	/opt/homebrew/Cellar/go/1.18/libexec/src/runtime/panic.go:838 +0x204
github.com/intelligent-machine-learning/easydl/dlrover/go/operator/pkg/controllers/psstrategy.(*PSTaskManager).getTotalTaskCount(...)
	/Users/lazylong/workspace/easydl/dlrover/go/operator/pkg/controllers/psstrategy/strategy.go:154
github.com/intelligent-machine-learning/easydl/dlrover/go/operator/pkg/controllers/psstrategy.(*PSTaskManager).HandleFaultPods(0x1400045cd50, {0x10429d8f8, 0x14000131220}, 0x140007f04e0)
	/Users/lazylong/workspace/easydl/dlrover/go/operator/pkg/controllers/psstrategy/strategy.go:267 +0x28c
github.com/intelligent-machine-learning/easydl/dlrover/go/operator/pkg/controllers.(*ElasticJobReconciler).handleFaultPods(...)
	/Users/lazylong/workspace/easydl/dlrover/go/operator/pkg/controllers/elasticjob_controller.go:223
github.com/intelligent-machine-learning/easydl/dlrover/go/operator/pkg/controllers.(*ElasticJobReconciler).reconcileJobs(0x140006ff4c0, 0x140007f04e0)
	/Users/lazylong/workspace/easydl/dlrover/go/operator/pkg/controllers/elasticjob_controller.go:134 +0x614
github.com/intelligent-machine-learning/easydl/dlrover/go/operator/pkg/controllers.(*ElasticJobReconciler).Reconcile(0x140006ff4c0, {0x1042997d0?, 0x14000975d10?}, {{{0x140006e0c60, 0x7}, {0x1400016aba0, 0x11}}})
	/Users/lazylong/workspace/easydl/dlrover/go/operator/pkg/controllers/elasticjob_controller.go:96 +0x1e0
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x104299728?, {0x1042997d0?, 0x14000975d10?}, {{{0x140006e0c60?, 0x1041c3c00?}, {0x1400016aba0?, 0xc0376bcfabc18d3c?}}})
	/Users/lazylong/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121 +0x8c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0x14000228f00, {0x104299728, 0x140006ff400}, {0x10410e360?, 0x140001ccda0?})
	/Users/lazylong/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320 +0x2a8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0x14000228f00, {0x104299728, 0x140006ff400})
	/Users/lazylong/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273 +0x1b0
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	/Users/lazylong/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234 +0x78
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/Users/lazylong/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:230 +0x294
exit status 2

torchrun
    --nnodes=MIN_SIZE:MAX_SIZE
    --nproc_per_node=TRAINERS_PER_NODE
    --max_restarts=NUM_ALLOWED_FAILURES_OR_MEMBERSHIP_CHANGES
    --rdzv_id=JOB_ID
    --rdzv_backend=c10d
    --rdzv_endpoint=HOST_NODE_ADDR
    YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)

[2023-03-16 15:39:23,737] [INFO][tensorflow_failover.py:127:refresh_env] successfully refresh TF_CONFIFG {"cluster": {"worker": ["deepctr-auto-scale-edljob-worker-1:2222"], "ps": ["deepctr-auto-scale-edljob-ps-0.dlrover.svc:2222", "deepctr-auto-scale-edljob-ps-1.dlrover.svc:2222"], "chief": [""]}, "task": {"type": "worker", "index": 1}}
[2023-03-16 15:39:23,737] [INFO][tensorflow_failover.py:142:refresh_env] global dict is {'executor': <dlrover.trainer.tensorflow.executor.estimator_executor.EstimatorExecutor object at 0x7f7a56f5f790>, 'failover': <dlrover.trainer.tensorflow.failover.tensorflow_failover.TensorflowFailover object at 0x7f7a56f5f7c0>, 'relaunch_for_ps': True}
[2023-03-16 15:39:23,748] [INFO][file_reader.py:88:iterator] shard is name: "iris_training_data"
start: 128
end: 160

[2023-03-16 15:39:23,753] [INFO][elastic_data_shard_report_hook.py:26:after_run] report_batch_done
[2023-03-16 15:39:23,753] [INFO][estimator_util.py:33:after_run] The training thread should stop for due to ps migration/scaling
[2023-03-16 15:39:23,753] [INFO] [master_client.py:319:join_sync]  1:worker join sync relauch_for_ps
[2023-03-16 15:39:23,754] [INFO][estimator_util.py:41:after_run] Before stopping training thread,                  worker should wait for cheif to save checkpoint
[2023-03-16 15:39:34,768] [INFO][estimator_util.py:49:after_run] Training thread stopped because chief had saved checkpoint
[2023-03-16 15:39:34,768] [INFO][global_step_hook.py:42:end] hook end
[2023-03-16 15:39:34,885] [INFO][estimator.py:371:train] Loss for final step: 0.875.
[2023-03-16 15:39:34,886] [INFO][tf_kubernetes_worker.py:77:run] ps is migrating or scaling

[2023-03-16 15:39:34,886] [INFO][tf_kubernetes_worker.py:42:init_executor] init_executor
[2023-03-16 15:39:34,886] [INFO][tensorflow_failover.py:41:__init__] initiating tensorflow_failover and failover level is 1
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/dlrover/trainer/entry/local_entry.py", line 27, in <module>
    starter.run()
  File "/usr/local/lib/python3.8/dist-packages/dlrover/trainer/platform/starter.py", line 94, in run
    return execute(args)
  File "/usr/local/lib/python3.8/dist-packages/dlrover/trainer/platform/starter.py", line 85, in execute
    return worker.run()
  File "/usr/local/lib/python3.8/dist-packages/dlrover/trainer/worker/tf_kubernetes_worker.py", line 62, in run
    self.start_failover_monitor()
  File "/usr/local/lib/python3.8/dist-packages/dlrover/trainer/worker/tf_kubernetes_worker.py", line 48, in start_failover_monitor
    self.tensorflow_failover = TensorflowFailover()
  File "/usr/local/lib/python3.8/dist-packages/dlrover/trainer/tensorflow/failover/tensorflow_failover.py", line 48, in __init__
    self.init_for_dynet()
  File "/usr/local/lib/python3.8/dist-packages/dlrover/trainer/tensorflow/failover/tensorflow_failover.py", line 56, in init_for_dynet
    self._address = TF_CONFIG["cluster"][task_type][task_id]
IndexError: list index out of range