<a target="_blank" rel="noopener noreferrer nofollow" href="https://user-images.github

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Training status is PENDING not change about ffdl HOT 9 OPEN

ibm commented on May 18, 2024

Training status is PENDING not change

from ffdl.

Comments (9)

bleachzk commented on May 18, 2024

LCM logs:

{"level":"info","msg":"transport: http2Server.HandleStreams failed to read frame: read tcp [::1]:8443-\u003e[::1]:53622: read: connection reset by peer","time":"2018-06-20T08:56:58Z"}

from ffdl.

Tomcli commented on May 18, 2024

Hi @bleachzk, can I have some details about your job? (e.g. $CLI_CMD show training-gSR-qONmR). If you are requesting GPUs for your training job, do you have any GPU resources available on your Kubernetes Cluster?

from ffdl.

bleachzk commented on May 18, 2024

Model definition:
Name: tf_convolutional_network_tutorial
Description: Convolutional network model using tensorflow
Framework: tensorflow:1.7.0-gpu-py3
Training:
Status: PENDING
Submitted: N/A
Completed: N/A
Resources: 2.00 CPUs | 1.00 GPUs | 2.00 GB Mem | 1 node(s)
Command: python3 convolutional_network.py --trainImagesFile ${DATA_DIR}/train-images-idx3-ubyte.gz
--trainLabelsFile ${DATA_DIR}/train-labels-idx1-ubyte.gz --testImagesFile ${DATA_DIR}/t10k-images-idx3-ubyte.gz
--testLabelsFile ${DATA_DIR}/t10k-labels-idx1-ubyte.gz --learningRate 0.001
--trainingIters 20000
Input data : sl-internal-os-input
Output data: sl-internal-os-output
Data stores:
ID: sl-internal-os-input
Type: s3_datastore
Connection:
auth_url: http://s3.default.svc.cluster.local
bucket: tf_training_data
password: test
type: s3_datastore
user_name: test
ID: sl-internal-os-output
Type: s3_datastore
Connection:
auth_url: http://s3.default.svc.cluster.local
bucket: tf_trained_model
password: test
type: s3_datastore
user_name: test
Summary metrics:
OK

from ffdl.

bleachzk commented on May 18, 2024

@Tomcli I have set nvidia-device-plugin

from ffdl.

Tomcli commented on May 18, 2024

Hi @bleachzk, did you deploy ffdl-lcm with device-plugin tag? (e.g. helm install --set lcm.version="device-plugin" .), since ffdl-lcm:latest will use accelerators for GPU resources.

After you changed ffdl-lcm with device-plugin tag, all the new GPU jobs should consume nvidia.com/gpu resources.

As accelerator deprecated in K8s 1.10, we will add a new pre-0.1 patch to FfDL this week to use device-plugin as default.

from ffdl.

bleachzk commented on May 18, 2024

@Tomcli after upgrade to v0.1，leaner pod start error：

MountVolume.SetUp failed for volume "cosoutputmount-107082b8-77ec-4686-61f6-b87d630babfb" : mount command failed, status: Failure, reason: Error mounting volume: s3fs mount failed: s3fs: error while loading shared libraries: libfuse.so.2: cannot open shared object file: No such file or directory

from ffdl.

Tomcli commented on May 18, 2024

Hi @bleachzk , with our new v0.1 release, we require users to install the s3fs drivers on each of their nodes (e.g. using the storage-plugin helm chart or follow the ibmcloud-object-storage-plugin instructions ).

Since the s3fs installation may vary based on different Kubernetes environment, I can point you to a more specific instruction if you can let me know what kind of Kubernetes environment your are using.

Thanks.

from ffdl.

bleachzk commented on May 18, 2024

System version：CentOS 7.2 3.10.0-514.26.2.el7.x86_64
Kubernetes version：1.10
Docker version：CE 18.03
@Tomcli

from ffdl.

Tomcli commented on May 18, 2024

@bleachzk For your cluster, I think you need to install the s3fs drivers and copy the driver binary ibmc-s3fs on each of your worker nodes.

sudo apt-get install s3fs
sudo mkdir -p /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs
sudo cp <FfDL repo>/bin/ibmc-s3fs /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs
sudo chmod +x /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs/ibmc-s3fs
sudo systemctl restart kubelet

Then, install the storage-plugin helm chart if you haven't done it.

helm install storage-plugin --set cloud=false

Then your learner pods should able to mount on any S3 Object Storage.

from ffdl.

Training status is PENDING not change about ffdl HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent