Comments (9)
LCM logs:
{"level":"info","msg":"transport: http2Server.HandleStreams failed to read frame: read tcp [::1]:8443-\u003e[::1]:53622: read: connection reset by peer","time":"2018-06-20T08:56:58Z"}
from ffdl.
Hi @bleachzk, can I have some details about your job? (e.g. $CLI_CMD show training-gSR-qONmR
). If you are requesting GPUs for your training job, do you have any GPU resources available on your Kubernetes Cluster?
from ffdl.
Model definition:
Name: tf_convolutional_network_tutorial
Description: Convolutional network model using tensorflow
Framework: tensorflow:1.7.0-gpu-py3
Training:
Status: PENDING
Submitted: N/A
Completed: N/A
Resources: 2.00 CPUs | 1.00 GPUs | 2.00 GB Mem | 1 node(s)
Command: python3 convolutional_network.py --trainImagesFile ${DATA_DIR}/train-images-idx3-ubyte.gz
--trainLabelsFile ${DATA_DIR}/train-labels-idx1-ubyte.gz --testImagesFile ${DATA_DIR}/t10k-images-idx3-ubyte.gz
--testLabelsFile ${DATA_DIR}/t10k-labels-idx1-ubyte.gz --learningRate 0.001
--trainingIters 20000
Input data : sl-internal-os-input
Output data: sl-internal-os-output
Data stores:
ID: sl-internal-os-input
Type: s3_datastore
Connection:
auth_url: http://s3.default.svc.cluster.local
bucket: tf_training_data
password: test
type: s3_datastore
user_name: test
ID: sl-internal-os-output
Type: s3_datastore
Connection:
auth_url: http://s3.default.svc.cluster.local
bucket: tf_trained_model
password: test
type: s3_datastore
user_name: test
Summary metrics:
OK
from ffdl.
@Tomcli I have set nvidia-device-plugin
from ffdl.
Hi @bleachzk, did you deploy ffdl-lcm
with device-plugin tag? (e.g. helm install --set lcm.version="device-plugin" .
), since ffdl-lcm:latest
will use accelerators for GPU resources.
After you changed ffdl-lcm
with device-plugin
tag, all the new GPU jobs should consume nvidia.com/gpu
resources.
As accelerator deprecated in K8s 1.10, we will add a new pre-0.1 patch to FfDL this week to use device-plugin as default.
from ffdl.
@Tomcli after upgrade to v0.1,leaner pod start error:
MountVolume.SetUp failed for volume "cosoutputmount-107082b8-77ec-4686-61f6-b87d630babfb" : mount command failed, status: Failure, reason: Error mounting volume: s3fs mount failed: s3fs: error while loading shared libraries: libfuse.so.2: cannot open shared object file: No such file or directory
from ffdl.
Hi @bleachzk , with our new v0.1 release, we require users to install the s3fs drivers on each of their nodes (e.g. using the storage-plugin
helm chart or follow the ibmcloud-object-storage-plugin instructions ).
Since the s3fs installation may vary based on different Kubernetes environment, I can point you to a more specific instruction if you can let me know what kind of Kubernetes environment your are using.
Thanks.
from ffdl.
System version:CentOS 7.2 3.10.0-514.26.2.el7.x86_64
Kubernetes version:1.10
Docker version:CE 18.03
@Tomcli
from ffdl.
@bleachzk For your cluster, I think you need to install the s3fs drivers and copy the driver binary ibmc-s3fs
on each of your worker nodes.
sudo apt-get install s3fs
sudo mkdir -p /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs
sudo cp <FfDL repo>/bin/ibmc-s3fs /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs
sudo chmod +x /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs/ibmc-s3fs
sudo systemctl restart kubelet
Then, install the storage-plugin
helm chart if you haven't done it.
helm install storage-plugin --set cloud=false
Then your learner pods should able to mount on any S3 Object Storage.
from ffdl.
Related Issues (20)
- FfDL v0.1.1 model training error HOT 4
- FfDL CLI output is not properly machine parsable
- [Documentation] Update IBM Cloud CLI instructions in /etc/converter/train-deploy-wml.md
- dind-port-forward.sh -> invalid resource name ? HOT 5
- Grafana charts shows no data points HOT 1
- Unable to mount volumes for pod Learner HOT 8
- Learner pod stuck at training step 100 using custom image with TF Object Detection HOT 5
- / FfDL/demos/fashion-mnist-adversarial/README.md references internal repository HOT 1
- how to use pytorch and caffe built by ourselves? HOT 2
- kubectl get pods :lcm ContainerCreating,prometheus trainer and trainingdata STATUS CrashLoopBackOff HOT 26
- tiller-deploy is in status CrashLoopBackOff HOT 2
- Confused about manifest.yml HOT 2
- learner pod failed HOT 19
- caffe training speed is very slow HOT 4
- pytorch training issue: insufficient shared memory HOT 2
- distributed training questions HOT 2
- why pytorch distributed training on two servers is slower than training on one server HOT 21
- .travis.yml: The 'sudo' tag is now deprecated in Travis CI
- ssh permission denied when deploying FfDL on public cloud
- fail to install
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ffdl.