intelligent-machine-learning / dlrover Goto Github PK
View Code? Open in Web Editor NEWDLRover: An Automatic Distributed Deep Learning System
License: Other
DLRover: An Automatic Distributed Deep Learning System
License: Other
Explain why we need dynamic data sharding and how it works.
_initial_nodes is a little confusing. _initial_nodes serves as a queue storing pods template. Once pod is created, pods template is removed from _initial_nodes.
Dynamic data sharding can dispatch shards to workers during training and recover the task if a worker fails.
deeprec
to implement delta model exportation.Add to hook to check pre-requisities when executing scale plan. Because in different occasion, the pre-requisities are different.
The PS/worker Pod may be preempted during training. The ElasticJob should restore delete pods to support fault tolerance.
|-brain # Automaticallly generates the resource plan of the job.
|-operator
|-controllers
|-elastic-job. # Creates a k8s Job
|-resource-scale # Scale out or in the job resource according to the Custom Resource(CR)
|-elasticdl # dispatches data shards to workers and monitors training nodes.
|-easydl # APIs for the training loop of TensorFlow/Pytorch to use elastic training.
tf.estimator
framework in AntGroup. However, Keras is more common than tf.estimator
and TF 2.x has supported training a Keras model using ParameterServerStrategy. In another way, we can implement a trainer based on tf.estimator
and convert a Keras model to an estimator model in TensorFlow.Brain requires a job to specify a couple of parameters when to process the requests from the job, e.g., processor, data store, config retriever. Now those parameters are constants in the master source codes. In this way, it is inconvenient to update those parameters. Furthermore, all jobs are sharing the same configuration. Therefore, it'd be better to make those parameters configurable on the job's yaml.
INFO[0291] jobName: elasticjob-sample, phase Running
INFO[0291] Master elasticjob-elasticjob-sample-master is deleted and relaunch a new one elasticjob-elasticjob-sample-master
INFO[0291] Pod elasticjob-sample-edljob-ps-0 is deleted and will be relaunched
1.670830146192972e+09 INFO Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference {"controller": "elasticjob", "controllerGroup": "elastic.iml.github.io", "controllerKind": "ElasticJob", "elasticJob": {"name":"elasticjob-sample","namespace":"default"}, "namespace": "default", "name": "elasticjob-sample", "reconcileID": "c5e12fb0-74bb-404a-a2a6-d6ea166b7673"}
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x8 pc=0x103d4475c]
goroutine 269 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
/Users/lazylong/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:118 +0x1a0
panic({0x1040bca60, 0x104d06ea0})
/opt/homebrew/Cellar/go/1.18/libexec/src/runtime/panic.go:838 +0x204
github.com/intelligent-machine-learning/easydl/dlrover/go/operator/pkg/controllers/psstrategy.(*PSTaskManager).getTotalTaskCount(...)
/Users/lazylong/workspace/easydl/dlrover/go/operator/pkg/controllers/psstrategy/strategy.go:154
github.com/intelligent-machine-learning/easydl/dlrover/go/operator/pkg/controllers/psstrategy.(*PSTaskManager).HandleFaultPods(0x1400045cd50, {0x10429d8f8, 0x14000131220}, 0x140007f04e0)
/Users/lazylong/workspace/easydl/dlrover/go/operator/pkg/controllers/psstrategy/strategy.go:267 +0x28c
github.com/intelligent-machine-learning/easydl/dlrover/go/operator/pkg/controllers.(*ElasticJobReconciler).handleFaultPods(...)
/Users/lazylong/workspace/easydl/dlrover/go/operator/pkg/controllers/elasticjob_controller.go:223
github.com/intelligent-machine-learning/easydl/dlrover/go/operator/pkg/controllers.(*ElasticJobReconciler).reconcileJobs(0x140006ff4c0, 0x140007f04e0)
/Users/lazylong/workspace/easydl/dlrover/go/operator/pkg/controllers/elasticjob_controller.go:134 +0x614
github.com/intelligent-machine-learning/easydl/dlrover/go/operator/pkg/controllers.(*ElasticJobReconciler).Reconcile(0x140006ff4c0, {0x1042997d0?, 0x14000975d10?}, {{{0x140006e0c60, 0x7}, {0x1400016aba0, 0x11}}})
/Users/lazylong/workspace/easydl/dlrover/go/operator/pkg/controllers/elasticjob_controller.go:96 +0x1e0
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x104299728?, {0x1042997d0?, 0x14000975d10?}, {{{0x140006e0c60?, 0x1041c3c00?}, {0x1400016aba0?, 0xc0376bcfabc18d3c?}}})
/Users/lazylong/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121 +0x8c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0x14000228f00, {0x104299728, 0x140006ff400}, {0x10410e360?, 0x140001ccda0?})
/Users/lazylong/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320 +0x2a8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0x14000228f00, {0x104299728, 0x140006ff400})
/Users/lazylong/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273 +0x1b0
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
/Users/lazylong/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234 +0x78
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
/Users/lazylong/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:230 +0x294
exit status 2
We need to implement a monitor to watch the training speed with workers.
We should write an introduction to EasyDL in ReadMe.
Provide definition and suggested usage for NodeGroupResource, lunch_nodes, removed_nodes.
DLRover can set the service address of the 1st worker as rdzv_endpoint to execute torchrun.
torchrun
--nnodes=MIN_SIZE:MAX_SIZE
--nproc_per_node=TRAINERS_PER_NODE
--max_restarts=NUM_ALLOWED_FAILURES_OR_MEMBERSHIP_CHANGES
--rdzv_id=JOB_ID
--rdzv_backend=c10d
--rdzv_endpoint=HOST_NODE_ADDR
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
provide an end-to-end system test which tests go module, dlrover.python and dlrover.trainer module.
[2023-03-16 15:39:23,737] [INFO][tensorflow_failover.py:127:refresh_env] successfully refresh TF_CONFIFG {"cluster": {"worker": ["deepctr-auto-scale-edljob-worker-1:2222"], "ps": ["deepctr-auto-scale-edljob-ps-0.dlrover.svc:2222", "deepctr-auto-scale-edljob-ps-1.dlrover.svc:2222"], "chief": [""]}, "task": {"type": "worker", "index": 1}}
[2023-03-16 15:39:23,737] [INFO][tensorflow_failover.py:142:refresh_env] global dict is {'executor': <dlrover.trainer.tensorflow.executor.estimator_executor.EstimatorExecutor object at 0x7f7a56f5f790>, 'failover': <dlrover.trainer.tensorflow.failover.tensorflow_failover.TensorflowFailover object at 0x7f7a56f5f7c0>, 'relaunch_for_ps': True}
[2023-03-16 15:39:23,748] [INFO][file_reader.py:88:iterator] shard is name: "iris_training_data"
start: 128
end: 160
[2023-03-16 15:39:23,753] [INFO][elastic_data_shard_report_hook.py:26:after_run] report_batch_done
[2023-03-16 15:39:23,753] [INFO][estimator_util.py:33:after_run] The training thread should stop for due to ps migration/scaling
[2023-03-16 15:39:23,753] [INFO] [master_client.py:319:join_sync] 1:worker join sync relauch_for_ps
[2023-03-16 15:39:23,754] [INFO][estimator_util.py:41:after_run] Before stopping training thread, worker should wait for cheif to save checkpoint
[2023-03-16 15:39:34,768] [INFO][estimator_util.py:49:after_run] Training thread stopped because chief had saved checkpoint
[2023-03-16 15:39:34,768] [INFO][global_step_hook.py:42:end] hook end
[2023-03-16 15:39:34,885] [INFO][estimator.py:371:train] Loss for final step: 0.875.
[2023-03-16 15:39:34,886] [INFO][tf_kubernetes_worker.py:77:run] ps is migrating or scaling
[2023-03-16 15:39:34,886] [INFO][tf_kubernetes_worker.py:42:init_executor] init_executor
[2023-03-16 15:39:34,886] [INFO][tensorflow_failover.py:41:__init__] initiating tensorflow_failover and failover level is 1
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/dlrover/trainer/entry/local_entry.py", line 27, in <module>
starter.run()
File "/usr/local/lib/python3.8/dist-packages/dlrover/trainer/platform/starter.py", line 94, in run
return execute(args)
File "/usr/local/lib/python3.8/dist-packages/dlrover/trainer/platform/starter.py", line 85, in execute
return worker.run()
File "/usr/local/lib/python3.8/dist-packages/dlrover/trainer/worker/tf_kubernetes_worker.py", line 62, in run
self.start_failover_monitor()
File "/usr/local/lib/python3.8/dist-packages/dlrover/trainer/worker/tf_kubernetes_worker.py", line 48, in start_failover_monitor
self.tensorflow_failover = TensorflowFailover()
File "/usr/local/lib/python3.8/dist-packages/dlrover/trainer/tensorflow/failover/tensorflow_failover.py", line 48, in __init__
self.init_for_dynet()
File "/usr/local/lib/python3.8/dist-packages/dlrover/trainer/tensorflow/failover/tensorflow_failover.py", line 56, in init_for_dynet
self._address = TF_CONFIG["cluster"][task_type][task_id]
IndexError: list index out of range
Generating informer, client and lister for crd ElasticJob and scaler.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.