The edl from elasticdeeplearning

Test the edl's function with crd

the test includings:

Build images with Dockerfile;
Push images to docker hub;
Pull the images built above and run the EDL controller in k8s cluster;
Run the training job in k8s cluster to test the ASGD;

【bug】redis client超时失败

需要添加超时失败容错
Exception in thread Thread-3:
Traceback (most recent call last):
File "/root/paddlejob/workspace/env_run/python-2.7.14/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/root/paddlejob/workspace/env_run/python-2.7.14/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/root/paddlejob/workspace/env_run/python-2.7.14/lib/python2.7/site-packages/paddle_edl/distill/redis/client.py", line 95, in _heartbeat
msg = self._recv_msg()
File "/root/paddlejob/workspace/env_run/python-2.7.14/lib/python2.7/site-packages/paddle_edl/distill/redis/client.py", line 55, in _recv_msg
head = self.client.recv(self.HEAD_SIZE)
error: [Errno 110] Connection timed out

IP address and Port concat error

When I tried to run distributed CTR training on K8S Cluster. There is mechanical to collect all pservers' IP and port. But it seems that the concatenation of IP and port looks weird.

initially it is 172.20.1.69:30236, 172.20.1.70:30237, which are pserver 1 and pserver 2's IP.
but then it displays 172.20.1.69:30236, 172.20.1.70.30237:30236. the pserver 2 has two ports.

Transfer the edl repo to its own GitHub organization

Please complete the transfer of edl repo to its own GitHub org located in: https://github.com/elasticdeeplearning/.

Here are the steps:

Save the README.md file from the https://github.com/elasticdeeplearning/edl to use it later on in step 4 because it has been updated a little
Delete repo https://github.com/elasticdeeplearning/edl (this was created as a copied repo and not as a transfer)
Follow these instructions: https://help.github.jp/enterprise/2.11/user/articles/transferring-a-repository-owned-by-your-personal-account/ under: "Transferring a repository to another user account or to an organization". This will transfer https://github.com/PaddlePaddle/edl to https://github.com/elasticdeeplearning/edl
Copy the saved README.md (see (1) above) into the newly transferred repo (https://github.com/elasticdeeplearning/edl) so we don't have to update it again
https://github.com/PaddlePaddle/edl should not exist anymore. When you try to go to it, it will automatically FW to https://github.com/elasticdeeplearning/edl

Thank you.
Ibrahim

readme是否可以提供下中文？

Need a EDL logo and embed on README.md

Trouble Running Resnet + Imagenet Demo

Hello, I've been trying to run the resnet + imagenet demo shown here https://github.com/elasticdeeplearning/edl/tree/develop/example/demo/collective for several days now but with no success. I've tried doing this on my local machine by pip installing paddle_edl into a conda environment and all associated requirements in addition to trying with the recommended docker image:

docker pull hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda9.0-cudnn7
nvidia-docker run -name paddle_edl hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda9.0-cudnn7 /bin/bash

I'm having the suspicion that this demo is outdated and I was wondering if it could be updated or explained in more detail so that I can get it working. If the demo is still in fact working, can someone run the demo using the expected docker image and let me know of the exact steps to replicate it.

I've tried running the demo using several different combinations of steps but here is what I'm doing in general.

Reproduction Steps:

Enter the recommended docker image and mount my imagenet dataset.
Enter edl/example/demo/collective
Set PADDLE_EDL_IMAGENET_PATH, PADDLE_EDL_FLEET_CHECKPOINT_PATH and PADDLE_JOBSERVER
Run ./start_job_server.sh
Run ./start_job_client.sh
Find failures in either the pod logs, the worker log in each pod directory or client/server logs

Some issues that I have faced so far:

I don't know what the specifications are for train.txt, test.txt or val.txt for the imagenet dataset and have errors using mine with the edl demo. What is the expected preprocessing strategy that edl uses for their imagenet dataset and how is it structured so I can use my own imagenet dataset.
This line (

edl/example/demo/collective/resnet50/package.sh

Line 33 in dbe38fb

src_dir=../../../collective/resnet50

) should be changed to src_dir=../../collective/resnet50. There must have been some directories moved around as this is not the correct pathway to the resnet files.
All but one of the created pods manage to establish a connection its desired endpoint. All the failed pods output a message such as:

not ready endpoints:['127.0.0.1:8073', '127.0.0.1:8075', '127.0.0.1:8077', '127.0.0.1:8079', '127.0.0.1:8081', '127.0.0.1:8083', '127.0.0.1:8085']
server not ready, wait 3 sec to retry...

nohup python -u paddle_edl.demo.collective.job_server_demo should use an -m flag instead of -u as paddle_edl cannot be found otherwise

edl/example/demo/collective/start_job_server.sh

Line 26 in dbe38fb

nohup python -u paddle_edl.demo.collective.job_server_demo \
Same as the above point but for the client bash file

edl/example/demo/collective/start_job_client.sh

Line 33 in dbe38fb

nohup python -u paddle_edl.demo.collective.job_client_demo \
These are a subset of the total issues I've had to debug to come this far
I've also tried running the MNIST tutorial with no luck as well (https://github.com/elasticdeeplearning/edl/blob/develop/doc/boss_tutorial.md)

System information:

PaddlePaddle version: I have tried with v1.8.5 locally and whatever version is packaged into the docker image
EDL version: I have tried with v0.3.1 locally and whatever version is packaged into the docker image
GPU: Tesla M60 with CUDA 9.0 and CUDNN 7.0
OS Platform: Ubuntu 16.04.6 LTS

Thanks and looking forward to demoing the project!

Add a EDL tutorial

We need an EDL tutorial to introduce how to use EDL on a Kubernetes cluster.

Fix unittest under python3

Fix unit test under python3

test_distill_reader.sh 
test_redis_distill_reader.sh

[Question] Does edl rely on the PaddlePaddle elastic learning capability

To my knowledge, edl should rely on the capability of PaddlePaddle workload. In other words, PaddlePaddle must has the ability of elastic learning.

If my understanding is correct, could you please direct me where I can find the design doc of elastic learning of PaddlePaddle workload? I can understand a bit of edl itself from the source code, but I cannot find anything about the elastic PaddlePaddle workload.

Hyperlinks to our documents are broken

Reported by @Haichao-Zhang :

The hyperlinks in this post on Baidu Research Blog to our EDL documents broke.

Could somebody fix it by writing a diff from the current block post content to the right one, and I can ask the administrator of Baidu Research Blog to correct their content.

Thanks!

[Question]k8s native edl should rely on the capability of PaddlePaddle workload.

To my knowledge, edl should rely on the capability of PaddlePaddle workload. In other words, PaddlePaddle must has the ability of elastic learning.

If my understanding is correct, could you please direct me where I can find the design doc of elastic learning of PaddlePaddle workload? I can understand a bit of edl itself from the source code, but I cannot find anything about the elastic PaddlePaddle workload.

[request] Please update the links in README

I found that some links in README are 404, I think we should update these links. For example, I am looking for docs for Fault-Tolerant Training in PaddlePaddle. But I cannot find it.

I'd appreciate it if anyone could help me.

Thanks 🥂 🍻

[Question]Can set different min-instance and max-instance for pserver?

The example shows the min-instance and max-instance value are same for pserver.

pserver:
min-instance: 2
max-instance: 2
resources:
...

Can I set them into different value? Thanks.

Add "how to run on MiniKube"

The README.md file contains only "How to Build". We also need "How to Run".

[Question]k8s native edl should rely on the capability of PaddlePaddle workload.

To my knowledge, edl should rely on the capability of PaddlePaddle workload. In other words, PaddlePaddle must has the ability of elastic learning.

If my understanding is correct, could you please direct me where I can find the design doc of elastic learning of PaddlePaddle workload? I can understand a bit of edl itself from the source code, but I cannot find anything about the elastic PaddlePaddle workload.

Need to group scheduling the pods in single operator to prevent starving

Need to group scheduling the pods in single operator to prevent resource starving, investigating into kube-arbitrator ATM, have this issue open to keep track with.

Run Fluid with EDL

Tasks

Write doc to demonstrate edl function with crd

Refactor the implementation of single job lifecycle control

[Question] Does the edl rely on the capability of PaddlePaddle workload

To my knowledge, edl should rely on the capability of PaddlePaddle workload. In other words, PaddlePaddle must has the ability of elastic learning.

If my understanding is correct, could you please direct me where I can find the design doc of elastic learning of PaddlePaddle workload? I can understand a bit of edl itself from the source code, but I cannot find anything about the elastic PaddlePaddle workload.

Add feature that can pass user defined values in TrainingJob to actual pod specs

For example, sometimes we need to specify imagePullSecret or hostNetwork field for all of the pods, including pservers and trainers.

Roadmap for supporting different frameworks

design doc of implementing generic python API tools to enable fault tolerant. Developers can insert some lines of code in their training program to enable fault tolerant training -- 2 week with a discussion
implement this generic python API available for at least 1 framework. -- 2 weeks
polish CRD implementation, and run test on real clusters. -- 2 weeks
implement and test for more frameworks: Tensorflow, Keras, Caffe2 -- 4 weeks

【bug】redis balance服务线程挂掉

EDL controller does not submit created resources currently

We intend to submit ReplicaSets and Job for each TrainingJob this is not implemented yet,
see code from here: https://github.com/PaddlePaddle/edl/blob/develop/pkg/controller.go#L133

[Question]k8s native edl should rely on the capability of PaddlePaddle workload.

To my knowledge, edl should rely on the capability of PaddlePaddle workload. In other words, PaddlePaddle must has the ability of elastic learning.

If my understanding is correct, could you please direct me where I can find the design doc of elastic learning of PaddlePaddle workload? I can understand a bit of edl itself from the source code, but I cannot find anything about the elastic PaddlePaddle workload.

elasticdeeplearning / edl Goto Github PK

edl's People

Contributors

Stargazers

Watchers

Forkers

edl's Issues

Recommend Projects

Recommend Topics

Recommend Org