Code Monkey home page Code Monkey logo

Comments (9)

bleachzk avatar bleachzk commented on May 18, 2024

LCM logs:

{"level":"info","msg":"transport: http2Server.HandleStreams failed to read frame: read tcp [::1]:8443-\u003e[::1]:53622: read: connection reset by peer","time":"2018-06-20T08:56:58Z"}

from ffdl.

Tomcli avatar Tomcli commented on May 18, 2024

Hi @bleachzk, can I have some details about your job? (e.g. $CLI_CMD show training-gSR-qONmR). If you are requesting GPUs for your training job, do you have any GPU resources available on your Kubernetes Cluster?

from ffdl.

bleachzk avatar bleachzk commented on May 18, 2024

Model definition:
Name: tf_convolutional_network_tutorial
Description: Convolutional network model using tensorflow
Framework: tensorflow:1.7.0-gpu-py3
Training:
Status: PENDING
Submitted: N/A
Completed: N/A
Resources: 2.00 CPUs | 1.00 GPUs | 2.00 GB Mem | 1 node(s)
Command: python3 convolutional_network.py --trainImagesFile ${DATA_DIR}/train-images-idx3-ubyte.gz
--trainLabelsFile ${DATA_DIR}/train-labels-idx1-ubyte.gz --testImagesFile ${DATA_DIR}/t10k-images-idx3-ubyte.gz
--testLabelsFile ${DATA_DIR}/t10k-labels-idx1-ubyte.gz --learningRate 0.001
--trainingIters 20000
Input data : sl-internal-os-input
Output data: sl-internal-os-output
Data stores:
ID: sl-internal-os-input
Type: s3_datastore
Connection:
auth_url: http://s3.default.svc.cluster.local
bucket: tf_training_data
password: test
type: s3_datastore
user_name: test
ID: sl-internal-os-output
Type: s3_datastore
Connection:
auth_url: http://s3.default.svc.cluster.local
bucket: tf_trained_model
password: test
type: s3_datastore
user_name: test
Summary metrics:
OK

from ffdl.

bleachzk avatar bleachzk commented on May 18, 2024

@Tomcli I have set nvidia-device-plugin
1

from ffdl.

Tomcli avatar Tomcli commented on May 18, 2024

Hi @bleachzk, did you deploy ffdl-lcm with device-plugin tag? (e.g. helm install --set lcm.version="device-plugin" .), since ffdl-lcm:latest will use accelerators for GPU resources.

After you changed ffdl-lcm with device-plugin tag, all the new GPU jobs should consume nvidia.com/gpu resources.

As accelerator deprecated in K8s 1.10, we will add a new pre-0.1 patch to FfDL this week to use device-plugin as default.

from ffdl.

bleachzk avatar bleachzk commented on May 18, 2024

@Tomcli after upgrade to v0.1,leaner pod start error:

MountVolume.SetUp failed for volume "cosoutputmount-107082b8-77ec-4686-61f6-b87d630babfb" : mount command failed, status: Failure, reason: Error mounting volume: s3fs mount failed: s3fs: error while loading shared libraries: libfuse.so.2: cannot open shared object file: No such file or directory

from ffdl.

Tomcli avatar Tomcli commented on May 18, 2024

Hi @bleachzk , with our new v0.1 release, we require users to install the s3fs drivers on each of their nodes (e.g. using the storage-plugin helm chart or follow the ibmcloud-object-storage-plugin instructions ).

Since the s3fs installation may vary based on different Kubernetes environment, I can point you to a more specific instruction if you can let me know what kind of Kubernetes environment your are using.

Thanks.

from ffdl.

bleachzk avatar bleachzk commented on May 18, 2024

System version:CentOS 7.2 3.10.0-514.26.2.el7.x86_64
Kubernetes version:1.10
Docker version:CE 18.03
@Tomcli

from ffdl.

Tomcli avatar Tomcli commented on May 18, 2024

@bleachzk For your cluster, I think you need to install the s3fs drivers and copy the driver binary ibmc-s3fs on each of your worker nodes.

sudo apt-get install s3fs
sudo mkdir -p /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs
sudo cp <FfDL repo>/bin/ibmc-s3fs /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs
sudo chmod +x /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs/ibmc-s3fs
sudo systemctl restart kubelet

Then, install the storage-plugin helm chart if you haven't done it.

helm install storage-plugin --set cloud=false

Then your learner pods should able to mount on any S3 Object Storage.

from ffdl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.