Code Monkey home page Code Monkey logo

tfmesos's Introduction

TFMesos

Join the chat at https://gitter.im/douban/tfmesos

TFMesos is a lightweight framework to help running distributed Tensorflow Machine Learning tasks on Apache Mesos within Docker and Nvidia-Docker .

TFMesos dynamically allocates resources from a Mesos cluster, builds a distributed training cluster for Tensorflow, and makes different training tasks mangeed and isolated in the shared Mesos cluster with the help of Docker.

Prerequisites

  • For Mesos >= 1.0.0:
  1. Mesos Cluster (cf: Mesos Getting Started). All nodes in the cluster should be reachable using their hostnames, and all nodes have identical /etc/passwd and /etc/group.
  2. Setup Mesos Agent to enable Mesos Containerizer and Mesos Nvidia GPU Support (optional). eg: mesos-agent --containerizers=mesos --image_providers=docker --isolation=filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia
  3. (optional) A Distributed Filesystem (eg: MooseFS)
  4. Ensure latest TFMesos docker image (tfmesos/tfmesos) is pulled across the whole cluster
  • For Mesos < 1.0.0:
  1. Mesos Cluster (cf: Mesos Getting Started). All nodes in the cluster should be reachable using their hostnames, and all nodes have identical /etc/passwd and /etc/group.
  2. Docker (cf: Docker Get Start Tutorial)
  3. Mesos Docker Containerizer Support (cf: Mesos Docker Containerizer)
  4. (optional) Nvidia-docker installation (cf: Nvidia-docker installation) and make sure nvidia-plugin is accessible from remote host (with -l 0.0.0.0:3476)
  5. (optional) A Distributed Filesystem (eg: MooseFS)
  6. Ensure latest TFMesos docker image (tfmesos/tfmesos) is pulled across the whole cluster

If you are using AWS G2 instance, here is a sample script to setup most of there prerequisites.

Running simple Test

After setting up the mesos and pulling the docker image on a single node (or a cluser), you should be able to use the following command to run a simple test.

$ docker run -e MESOS_MASTER=mesos-master:5050 \
    -e DOCKER_IMAGE=tfmesos/tfmesos \
    --net=host \
    -v /path-to-your-tfmesos-code/tfmesos/examples/plus.py:/tmp/plus.py \
    --rm \
    -it \
    tfmesos/tfmesos \
    python /tmp/plus.py mesos-master:5050

Successfully running the test should result in an output of 42 on the console.

Running in replica mode

This mode is called Between-graph replication in official Distributed Tensorflow Howto

Most distributed training models that Google has open sourced (such as mnist_replica and inception) are using this mode. In this mode, two kind of Jobs are defined with the names 'ps' and 'worker'. 'ps' tasks act as 'Parameter Server' and 'worker' tasks run the actual training process.

Here we use our modified 'mnist_replica' as example:

  1. Checkout the mnist example codes into a directory in shared filesystem, eg: /nfs/mnist
  2. Assume Mesos master is mesos-master:5050
  3. Now we can launch this script using following commands:

CPU:

$ docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \
             --net=host \
             -v /nfs/mnist:/nfs/mnist \
             -v /etc/passwd:/etc/passwd:ro \
             -v /etc/group:/etc/group:ro \
             -u `id -u` \
             -w /nfs/mnist \
             tfmesos/tfmesos \
             tfrun -w 1 -s 1  \
             -V /nfs/mnist:/nfs/mnist \
             -- python mnist_replica.py \
             --ps_hosts {ps_hosts} --worker_hosts {worker_hosts} \
             --job_name {job_name} --worker_index {task_index}

GPU (1 GPU per worker):

$ nvidia-docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \
             --net=host \
             -v /nfs/mnist:/nfs/mnist \
             -v /etc/passwd:/etc/passwd:ro \
             -v /etc/group:/etc/group:ro \
             -u `id -u` \
             -w /nfs/mnist \
             tfmesos/tfmesos \
             tfrun -w 1 -s 1 -Gw 1 -- python mnist_replica.py \
             --ps_hosts {ps_hosts} --worker_hosts {worker_hosts} \
             --job_name {job_name} --worker_index {task_index}

Note:

In this mode, tfrun is used to prepare the cluster and launch the training script on each node, and worker #0 (the chief worker) will be launched in the local container. tfrun will substitute {ps_hosts}, {worker_hosts}, {job_name}, {task_index} with corresponding values of each task.

Running in fine-grained mode

This mode is called In-graph replication in official Distributed Tensorflow Howto

In this mode, we have more control over the cluster spec. All nodes in the cluster is remote and just running a Grpc server. Each worker is driven by a local thread to run the training task.

Here we use our modified mnist as example:

  1. Checkout the mnist example codes into a directory, eg: /tmp/mnist
  2. Assume Mesos master is mesos-master:5050
  3. Now we can launch this script using following commands:

CPU:

$ docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \
             --net=host \
             -v /tmp/mnist:/tmp/mnist \
             -v /etc/passwd:/etc/passwd:ro \
             -v /etc/group:/etc/group:ro \
             -u `id -u` \
             -w /tmp/mnist \
             tfmesos/tfmesos \
             python mnist.py

GPU (1 GPU per worker):

$ nvidia-docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \
             --net=host \
             -v /tmp/mnist:/tmp/mnist \
             -v /etc/passwd:/etc/passwd:ro \
             -v /etc/group:/etc/group:ro \
             -u `id -u` \
             -w /tmp/mnist \
             tfmesos/tfmesos \
             python mnist.py --worker-gpus 1

tfmesos's People

Contributors

ariesdevil avatar gengmao avatar giorgioercixu avatar gitter-badger avatar mckelvin avatar windreamer avatar xiaoyongzhu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tfmesos's Issues

More examples needed.

I think we need more examples including cifar10 and full-fledged models such as inception.

Tensorflow 0.11.x & CUDA 8.0 Support

Tensorflow 0.11.0 is now in release candidate, and CUDA 8.0 will be officially supported
With CUDA 8.0 we are expecting performance boost for Nvidia Pascal GPUs, abd better support for float16, and more.

Maybe we just need to update the docker file and the requirement.txt for the new release

ForcePullImage support?

Found tasks kicked off via tfmesos won't force pull docker image. I am interested in adding a ForcePullImage option like marathon and chronos, sounds good? Might need some study though. If you can point a direction that will be very helpful

Driver error when using a load balancer address of mesos masters

Use tfrun to start a training with 5 workers and 1 parameter server. Found whenever first worker finishes, the driver got following traceback in stderr. The framework quits, and ps and all workers except worker-0 got shutdown. Then worker-0 just hang forever.

I tensorflow/core/distributed_runtime/master_session.cc:1012] Start master session f5aa63fb69b3be22 with config: 

[tfrun] Process IO 2017-04-21 10:06:34,367 Failed to process event
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/pymesos/process.py", line 148, in read
    self._callback.process_event(event)
  File "/usr/local/lib/python2.7/dist-packages/pymesos/process.py", line 226, in process_event
    self.on_event(event)
  File "/usr/local/lib/python2.7/dist-packages/pymesos/scheduler.py", line 491, in on_event
    func(event)
  File "/usr/local/lib/python2.7/dist-packages/pymesos/scheduler.py", line 451, in on_update
    self.acknowledgeStatusUpdate(status)
  File "/usr/local/lib/python2.7/dist-packages/pymesos/scheduler.py", line 305, in acknowledgeStatusUpdate
    self._send(body)
  File "/usr/local/lib/python2.7/dist-packages/pymesos/scheduler.py", line 159, in _send
    resp = conn.getresponse()
  File "/usr/lib/python2.7/httplib.py", line 1136, in getresponse
    response.begin()
  File "/usr/lib/python2.7/httplib.py", line 453, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.7/httplib.py", line 417, in _read_status
    raise BadStatusLine(line)
BadStatusLine: ''
[tfrun] Process IO 2017-04-21 10:06:34,368 Thread abort:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/pymesos/process.py", line 308, in _run
    if not conn.read():
  File "/usr/local/lib/python2.7/dist-packages/pymesos/process.py", line 148, in read
    self._callback.process_event(event)
  File "/usr/local/lib/python2.7/dist-packages/pymesos/process.py", line 226, in process_event
    self.on_event(event)
  File "/usr/local/lib/python2.7/dist-packages/pymesos/scheduler.py", line 491, in on_event
    func(event)
  File "/usr/local/lib/python2.7/dist-packages/pymesos/scheduler.py", line 451, in on_update
    self.acknowledgeStatusUpdate(status)
  File "/usr/local/lib/python2.7/dist-packages/pymesos/scheduler.py", line 305, in acknowledgeStatusUpdate
    self._send(body)
  File "/usr/local/lib/python2.7/dist-packages/pymesos/scheduler.py", line 159, in _send
    resp = conn.getresponse()
  File "/usr/lib/python2.7/httplib.py", line 1136, in getresponse
    response.begin()
  File "/usr/lib/python2.7/httplib.py", line 453, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.7/httplib.py", line 417, in _read_status
    raise BadStatusLine(line)
BadStatusLine: ''

Mesos shows the first worker's status as FINISHED, others workers except worker-0 and parameter server are KILLED, and remain worker-0 as RUNNING. But I can kill worker-0 manually.

I tried to let all workers signal parameter server via a FIFO queue when their real work got done, and sleep. Then I let parameter server wait until all workers signaled it and exit without sleep. I found this approach worked. All workers finished their work and exit when parameter server exits, and their status are FINISHED at the end.

So, I wonder if it is supposed to make sure no workers exit before parameter servers? Is my approach legit? Also why there is no logs around above error - can you please elaborate what happened behind the scene?

program stuck when running plus.py

Hello, I am just getting started to use tfmesos
I try to run the simple test(plus.py) in readme
and, program seems stuck in a loop (task stuck in running state)
here is the traceback after a keyboard interrupt :
Traceback (most recent call last): File "/tmp/plus.py", line 38, in <module> main(sys.argv) File "/tmp/plus.py", line 22, in main with cluster(jobs_def, master=mesos_master, quiet=False) as c: File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tfmesos/__init__.py", line 19, in cluster s.start() File "/usr/local/lib/python2.7/dist-packages/tfmesos/scheduler.py", line 328, in start if readable(lfd): File "/usr/local/lib/python2.7/dist-packages/tfmesos/scheduler.py", line 310, in readable return bool(select.select([fd], [], [], 0.1)[0]) File "/usr/local/lib/python2.7/dist-packages/pymesos/process.py", line 27, in _handle_sigint return _prev_handler(signum, frame) KeyboardInterrupt
and here is traceback from mesos executor :
Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/usr/local/lib/python2.7/dist-packages/tfmesos/server.py", line 78, in <module> sys.exit(main(sys.argv)) File "/usr/local/lib/python2.7/dist-packages/tfmesos/server.py", line 25, in main response = recv(c) File "/usr/local/lib/python2.7/dist-packages/tfmesos/utils.py", line 13, in recv assert len(d) == struct.calcsize('>I'), repr(d) AssertionError: ''
It seems like a socket connection problem
but I can't find out what's wrong with the connection
Any idea? thanks

Getting started help needed

Hey there!

We are trying to set up a gpu cluster for running tensorflow jobs and your project seems very promising. We are having some issues even running the examples. Our setup is Ubuntu 16.04, running mesos master and 1 agent locally (127.0.01). We are able to run the simple example and successfully get an output of 42.

However we can't run the ps/worker example, the cpu docker run fails at:
File "mnist_replica.py", line 46, in <module> import input_data

We invoke the command as:
docker run --rm -it -e MESOS_MASTER=127.0.0.1:5050 \ -e http_proxy=<our_proxy_info> \ -e https_proxy=<our_proxy_info> \ --net=host \ -v /nfs/mnist:/nfs/mnist \ -v /etc/passwd:/etc/passwd:ro \ -v /etc/group:/etc/group:ro \ -u id -u \ -w /nfs/mnist \ tfmesos/tfmesos \ tfrun -w 1 -s 1 \ -V /nfs/mnist:/nfs/mnist \ -- python mnist_replica.py \ --ps_hosts {ps_hosts} --worker_hosts {worker_hosts} \ --job_name {job_name} --worker_index {task_index}

Does the data exist inside the container? Also, can you please explain, do we need to execute both the docker run for cpu and nvidia-docker run for gpu commands? Shouldn't the second also take care of the parameter servers? It's not very clear what is happening behind the scenes. When we try to execute the nvidia-docker run as:
nvidia-docker run --rm -it -e MESOS_MASTER=localhost:5050 \ -e http_proxy=<our_proxy_info> \ -e https_proxy=<our_proxy_info> \ --net=host \ -v /home/bxa005/tfmesos/examples/mnist:/nfs/mnist \ -v /etc/passwd:/etc/passwd:ro \ -v /etc/group:/etc/group:ro \ -u id -u \ -w /nfs/mnist \ tfmesos/tfmesos \ tfrun -w 1 -s 1 -Gw 1 -- python mnist_replica.py \ --ps_hosts {ps_hosts} --worker_hosts {worker_hosts} \ --job_name {job_name} --worker_index {task_index}

it fails at:
[tfrun] Process IO 2017-04-17 16:01:53,737 Failed to process event Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/pymesos/process.py", line 148, in read self._callback.process_event(event) File "/usr/local/lib/python2.7/dist-packages/pymesos/process.py", line 226, in process_event self.on_event(event) File "/usr/local/lib/python2.7/dist-packages/pymesos/scheduler.py", line 491, in on_event func(event) File "/usr/local/lib/python2.7/dist-packages/pymesos/scheduler.py", line 434, in on_offers self, [self._dict_cls(offer) for offer in offers] File "/usr/local/lib/python2.7/dist-packages/tfmesos/scheduler.py", line 250, in resourceOffers containerizer_type=self.containerizer_type File "/usr/local/lib/python2.7/dist-packages/tfmesos/scheduler.py", line 135, in to_task_info for src, dst in iteritems(self.volumes): File "/usr/local/lib/python2.7/dist-packages/six.py", line 599, in iteritems return d.iteritems(**kw) AttributeError: 'NoneType' object has no attribute 'iteritems'

Tensorflow 1.0 Support.

Tensorflow has released 1.0 for a while with many APIs and files changed. We should consider support tensorflow 1.0.

Problem running simple Test code

When I run:
$ docker run -e MESOS_MASTER=mesos-master:5050
-e DOCKER_IMAGE=tfmesos/tfmesos
--net=host
-v /path-to-your-tfmesos-code/tfmesos/examples/plus.py:/tmp/plus.py
--rm
-it
tfmesos/tfmesos
python /tmp/plus.py mesos-master:5050

I got docker: Error response from daemon: oci runtime error: container_linux.go:262: starting container process caused "exec: "/tmp/plus.py": permission denied"

Anyone could help me?

Yuxin

Readme

Are the instructions in Readme still valid?

tfMesos on mesos single-node cluster

Hi,

I am trying to run tfMesos on a single-node Mesos cluster with masters and slaves (set to 127.0.1.1). However, when I run plus.py in tfMesos/examples it does not work. The error is in the _start_tf_cluster when I tries to receive an "ok" response from the Mesos slave, but it actually does not receive it. It runs fine on a Mesos cluster of more than 1 node with master and slaves using External IP addresses of different machines in the cluster

Could you please let me know the problem? I have tried using localhost, 127.0.0.1 in the Mesos configuration. But, that does not help either. Mesos seems to be running fine. It seems that there is some issue in tfMesos code expecting to receive responses from the Mesos.

Here is the stack trace:

Traceback (most recent call last):
File "plus.py", line 44, in
main(sys.argv)
File "plus.py", line 28, in main
with cluster(jobs_def, master=mesos_master, quiet=False) as targets:
File "/usr/lib/python2.7/contextlib.py", line 17, in enter
return self.gen.next()
File "build/bdist.linux-x86_64/egg/tfmesos/init.py", line 19, in cluster
File "build/bdist.linux-x86_64/egg/tfmesos/scheduler.py", line 299, in start
File "build/bdist.linux-x86_64/egg/tfmesos/scheduler.py", line 255, in _start_tf_cluster
File "build/bdist.linux-x86_64/egg/tfmesos/utils.py", line 16, in recv
AssertionError: ''

Unified Containerization support for Apache Mesos 1.0

cf: tensorflow/tensorflow#1996 (comment)
quote from @klueska

Regarding problems figuring out how to enable GPU support -- I can help with that. We basically mimic the functionality of nvidia-docker so that anything that runs in nvidia-docker should now be able to run in mesos as well. Consider the following example:

$ mesos-master \
      --ip=127.0.0.1 \
      --work_dir=/var/lib/mesos
$ mesos-agent \
      --master=127.0.0.1:5050 \
      --work_dir=/var/lib/mesos \
      --image_providers=docker \
      --executor_environment_variables="{}" \
      --isolation="docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia"
$ mesos-execute \
      --master=127.0.0.1:5050 \
      --name=gpu-test \
      --docker_image=nvidia/cuda \
      --command="nvidia-smi" \
      --framework_capabilities="GPU_RESOURCES" \
      --resources="gpus:1"

The flags of note here are:

mesos-agent: 
      --isolation="docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia" 
mesos-execute: 
      --resources="gpus:1" 
      --framework_capabilities="GPU_RESOURCES" 

When launching an agent, both the cgroups/devices and the gpu/nvidia isolation flags are required for Nvidia GPU support in Mesos. Likewise, the docker/runtime and filesystem/linux flags are needed to enable running docker images with the unified containerizer.

The cgroups/devices flag tells the agent to restrict access to a specific set of devices when launching a task (i.e. a subset of the devices listed in /dev). The gpu/nvidia isolation flag allows the agent to grant / revoke access to GPUs on a per-task basis. It also handles automatic injection of the Nvidia libraries / volumes into the container if the label com.nvidia.volumes.needed = nvidia_driver is present in the docker image. The docker/runtime flag allows the agent to parse docker image files and containerize them. The filesystem/linux flag says to use linux specific functionality when creating / entering the new mount namespace for the container filesystem.

In addition to these agent isolation flags, Mesos requires frameworks that want to consume GPU resources to have the GPU_RESOURCES framework capability set. Without this, the master will not send an offer to a framework if it contains GPUs. The choice to make frameworks explicitly opt-in to this GPU_RESOURCES capability was to keep legacy frameworks from accidentally consuming a bunch of non-GPU resources on any GPU-capable machines in a cluster (and thus blocking your GPU jobs from running). It's not that big a deal if all of your nodes have GPUs, but in a mixed-node environment, it can be a big problem.

Finally, the --resources="gpus:1" flag tells the framework to only accept offers that contain at least 1 GPU. This is just an example of consuming a single GPU, you can (and probably should) build your framework to do something more interesting.

Hopefully you can extrapolate things from there. Let me know if you have any questions.

We can try to enable unified containerization for Mesos >= 1.0
- [ ] Add com.nvidia.volumes.needed = nvidia_driver label to tfmesos/tfmesos image

  • Get current Mesos master version MasterInfo.version
  • If Mesos version < 1.0, fallback with Mesos + Docker + Nvidia-docker combination
  • Else instead of using nvidia-docker volume parameters, set GPU_RESOURCES Capability

TFMesosScheduler hangs when a sub task failed to launch

Somehow a sub task failed to launch by unexpected registry issue

[tfrun] Process IO 2017-05-16 18:58:09,864 Tensorflow cluster registered. ( http://master.mesos:5050/#/frameworks/e46643c7-87ec-4929-b4dc-e6d694066102-0110 )
[tfrun] Process IO 2017-05-16 18:58:13,823 Task failed: 
<Task
  mesos_task_id=10
  addr=None
>, Failed to launch container: Failed to run 'docker -H unix:///var/run/docker.sock pull xxx': exited with status 1; stderr='error pulling image configuration: received unexpected HTTP status: 503 Service Unavailable
'

Tfrun hangs at that case. Strace shows it stuck in a loop of select calls.

select(5, [4], [], [], {0, 68058})      = 0 (Timeout)
select(5, [4], [], [], {0, 100000})     = 0 (Timeout)
select(5, [4], [], [], {0, 100000})     = 0 (Timeout)

Seems the code is at https://github.com/douban/tfmesos/blob/master/tfmesos/scheduler.py#L319. TFMesosScheduler keeps waiting forever for the failed task. Other tasks have connected, but as self._start_tf_cluster() was not reached yet, they were waiting there indefinitely too.

In this case, does it make sense to invoke self.stop() when statusUpdate() received a task failure? Similarly, should slaveLost(), executorLost() and error() stop driver too?

command can't run when surportting gpu

I startup slave on another node like this:

mesos-slave  --master=zk://ip:2181/mesos \
            --containerizers=docker,mesos \
            --hostname=node2 \
            --ip=ip \
            --isolation=cgroups/devices,gpu/nvidia \
            --log_dir=/var/log/mesos/ \
            --work_dir=/var/lib/mesos/ \
            --docker=/usr/bin/nvidia-docker \
            --executor_environment_variables='{"NV_DOCKER": "/usr/bin/docker"}' \
            $(curl -s localhost:3476/mesos/cli)

or like this:

mesos-slave --master=zk://ip:2181/mesos \
            --containerizers=docker,mesos \
            --hostname=node2\
            --ip=ip\
            --isolation=cgroups/devices,gpu/nvidia \
            --log_dir=/var/log/mesos \
            --work_dir=/var/lib/mesos

then gpu can be found:

I0906 22:11:23.642832 70287 slave.cpp:519] Agent resources: gpus():4; cpus():32; mem():63131; disk():15350; ports(*):[31000-32000]

but when I run mesos-execute:

 mesos-execute --master=$MASTER --name="cluster-test" --command="sleep 5" -v
I0906 22:14:41.094125 70483 scheduler.cpp:172] Version: 1.0.1
I0906 22:14:41.096889 70481 scheduler.cpp:461] New master detected at master@ip:5050
Subscribed with ID '035e3bd4-7642-434e-8d39-baafa35fe495-0014'

it then do nothing, the command do not startup.

I'm not sure it is a problem about tfmesos or mesos

__getitem__ note defined?

Just trying the basic example plus.py, get this error message:

INFO:tfmesos.scheduler:Tensorflow cluster registered. ( http://192.168.1.30:5050/#/frameworks/003fba48-218a-484d-97d2-3ab5c89c9257-0025 )
2017-05-18 15:49:50,823 [INFO] [tfmesos.scheduler] Device /job:ps/task:0 activated @ grpc://dcos-agent-3.novalocal:46126
INFO:tfmesos.scheduler:Device /job:ps/task:0 activated @ grpc://dcos-agent-3.novalocal:46126
2017-05-18 15:49:50,824 [INFO] [tfmesos.scheduler] Device /job:ps/task:1 activated @ grpc://dcos-agent-3.novalocal:39951
INFO:tfmesos.scheduler:Device /job:ps/task:1 activated @ grpc://dcos-agent-3.novalocal:39951
2017-05-18 15:49:50,824 [INFO] [tfmesos.scheduler] Device /job:worker/task:0 activated @ grpc://dcos-agent-3.novalocal:34271
INFO:tfmesos.scheduler:Device /job:worker/task:0 activated @ grpc://dcos-agent-3.novalocal:34271
2017-05-18 15:49:50,825 [INFO] [tfmesos.scheduler] Device /job:worker/task:1 activated @ grpc://dcos-agent-3.novalocal:42713
INFO:tfmesos.scheduler:Device /job:worker/task:1 activated @ grpc://dcos-agent-3.novalocal:42713
2017-05-18 15:49:50,841 [DEBUG] [tfmesos.scheduler] exit
DEBUG:tfmesos.scheduler:exit
Traceback (most recent call last):
File "/tmp/plus.py", line 38, in
main(sys.argv)
File "/tmp/plus.py", line 32, in main
with tf.Session(targets['/job:worker/task:0']) as sess:

# TypeError: 'TFMesosScheduler' object has no attribute 'getitem'

TypeError: range() integer end argument expected, got float.

Hi, we got an error when running the example in tutorial. We are using single node Mesos cluster with 1 gpu. When running the mnist.py example in the tfmesos/tfmesos docker image, it says:

python ./tfmesos/examples/mnist/mnist.py --worker-gpus 1
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
2016-11-15 22:06:47,260 [INFO] [tfmesos.scheduler] Tensorflow cluster registered. ( http://192.17.237.107:5050/#/frameworks/f8104b2b-e639-452f-820a-9a12243ec708-0015 )
No handlers could be found for logger "pymesos.process"
2016-11-15 22:06:47,357 [DEBUG] [tfmesos.scheduler] exit
Traceback (most recent call last):
File "./tfmesos/examples/mnist/mnist.py", line 31, in
with cluster(jobs_def, master=master, quiet=False) as targets:
File "/usr/lib/python2.7/contextlib.py", line 17, in enter
return self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/tfmesos/init.py", line 20, in cluster
yield s.start()
File "/usr/local/lib/python2.7/dist-packages/tfmesos/scheduler.py", line 306, in start
if readable(lfd):
File "/usr/local/lib/python2.7/dist-packages/tfmesos/scheduler.py", line 289, in readable
return bool(select.select([fd], [], [], 0.1)[0])
File "/usr/local/lib/python2.7/dist-packages/pymesos/process.py", line 23, in _handle_sigint
reraise(*exc_info)
File "/usr/local/lib/python2.7/dist-packages/pymesos/process.py", line 287, in _run
if not conn.read():
File "/usr/local/lib/python2.7/dist-packages/pymesos/process.py", line 129, in read
self._callback.process_event(event)
File "/usr/local/lib/python2.7/dist-packages/pymesos/process.py", line 208, in process_event
self.on_event(event)
File "/usr/local/lib/python2.7/dist-packages/pymesos/scheduler.py", line 442, in on_event
func(event)
File "/usr/local/lib/python2.7/dist-packages/pymesos/scheduler.py", line 392, in on_offers
self, [self._dict_cls(offer) for offer in offers]
File "/usr/local/lib/python2.7/dist-packages/tfmesos/scheduler.py", line 220, in resourceOffers
offered_gpus = list(range(resource.scalar.value))
TypeError: range() integer end argument expected, got float.

It seems resource.scalar.value is a float type and causing trouble for range(). Thanks!

Why use socket to connect to servers and run commands instead of SSH?

According to my understanding, in tfmesos, tfrun connects to server.py on the remote nodes through socket, then server.py runs the TensorFlow script received from scheduler by subprocess.

Why not run subprocess to SSH login to the servers then run the TF scripts? I know that Spark adopts SSH to login to workers, and SSH is more convenient and safer than socket.

Thanks!

scheduler select says not readable

I'm trying to get your simple demo.py working, I have mesos 1.1.0 running, but when I run the docker container, it hangs on the "cluster(" call. after digging through the code, I found that there is a loop inside Scheduler.start where the "def readable" is always returning false on the select. As a result it gets stuck in that infinite loop in the TFMESOSScheduler start.

Any ideas why that would be? The master is registering the call.
1201 15:15:24.657160 2424 http.cpp:391] HTTP POST for /master/api/v1/scheduler from 100.114.84.196:53344
I1201 15:15:24.661165 2424 master.cpp:2313] Received subscription request for HTTP framework '[tensorflow] /tmp/demo.py host1:5050'
I1201 15:15:24.661810 2424 master.cpp:2411] Subscribing framework '[tensorflow] /tmp/demo.py 100.114.84.222:5050' with checkpointing disabled and capabilities [ GPU_RESOURCES ]
I1201 15:15:24.662701 2423 hierarchical.cpp:275] Added framework 9d1e7dc2-dbcf-4a7f-bd67-2ba4dd82f695-0002

I'm running non-gpu mode.

Executor terminated

HI.~
I was using your project very well two weeks ago.
However, there was a problem after upgrading this version.
Below is the problem I face.

I using Mesos Version

  • mesos 1.1.0
  • cuda - 8.0
  • cudnn- 5.0

my Mesos running command

  • master
    mesos-master
    --ip=192.168.10.9
    --work_dir=/var/lib/mesos
    --log_dir=/var/lib/mesos/masterlog

  • agent
    mesos-agent
    --master=deepmaster:5050
    --work_dir=/var/lib/mesos
    --log_dir=/var/lib/mesos/slavelog
    --containerizers=docker,mesos
    --image_providers=docker
    --isolation="filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia"
    --executor_environment_variables='{"LD_LIBRARY_PATH": "/usr/local/lib/"}'

Running simple Test Result
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:119] Couldn't open CUDA library libcuda.so.1. LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: deepmaster
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:363] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 367.57 Mon Oct 3 20:37:01 PDT 2016
GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3)
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 367.57.0
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1092] LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1093] failed to find libcuda.so on this system: Failed precondition: could not dlopen DSO: libcuda.so.1; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
2016-12-22 08:07:27,775 [INFO] [tfmesos.scheduler] Tensorflow cluster registered. ( http://deepmaster:5050/#/frameworks/7b15d5da-b786-4cae-9d28-625972f40cfc-0000 )
INFO:tfmesos.scheduler:Tensorflow cluster registered. ( http://deepmaster:5050/#/frameworks/7b15d5da-b786-4cae-9d28-625972f40cfc-0000 )
2016-12-22 08:07:32,846 [WARNING] [tfmesos.scheduler] Task failed:
<Task
mesos_task_id=2
addr=None

, Executor terminated
WARNING:tfmesos.scheduler:Task failed:
<Task
mesos_task_id=2
addr=None
, Executor terminated
2016-12-22 08:07:33,246 [WARNING] [tfmesos.scheduler] Task failed:
<Task
mesos_task_id=0
addr=None
, Executor terminated
WARNING:tfmesos.scheduler:Task failed:
<Task
mesos_task_id=0
addr=None
, Executor terminated
2016-12-22 08:07:33,548 [WARNING] [tfmesos.scheduler] Task failed:
<Task
mesos_task_id=3
addr=None
, Executor terminated
WARNING:tfmesos.scheduler:Task failed:
<Task
mesos_task_id=3
addr=None
, Executor terminated
2016-12-22 08:07:33,649 [WARNING] [tfmesos.scheduler] Task failed:
<Task
mesos_task_id=1
addr=None
, Executor terminated
WARNING:tfmesos.scheduler:Task failed:
<Task
mesos_task_id=1
addr=None
, Executor terminated

Below mesos-agent Error Log
mesos-executor: error while loading shared libraries: libmesos-1.1.0.so: cannot open shared object file: No such file or directory

I've had this problem in the past and I've added --executor_environment_variables = '{"LD_LIBRARY_PATH": "/ usr / local / lib /"} to the command.

Unable to use python3 to run mnist.py

Following errors occur when running "python3 mnist.py mesos_master:5050":
Extracting MNIST_data/train-images-idx3-ubyte.gz
Traceback (most recent call last):
File "mnist.py", line 30, in
mnist = read_data_sets("MNIST_data/", one_hot=True)
File "/tfmesos/examples/mnist/input_data.py", line 195, in read_data_sets
train_images = extract_images(local_file)
File "/tfmesos/examples/mnist/input_data.py", line 62, in extract_images
buf = bytestream.read(rows * cols * num_images)
File "/usr/lib/python3.5/gzip.py", line 274, in read
return self._buffer.read(size)
File "/usr/lib/python3.5/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/usr/lib/python3.5/gzip.py", line 467, in read
buf = self._fp.read(io.DEFAULT_BUFFER_SIZE)
File "/usr/lib/python3.5/gzip.py", line 82, in read
return self.file.read(size)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/lib/io/file_io.py", line 112, in read
return pywrap_tensorflow.ReadFromStream(self._read_buf, length, status)
File "/usr/lib/python3.5/contextlib.py", line 66, in exit
next(self.gen)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.OutOfRangeError: reached end of file

I think it is a bug in the input_file.py to use tf.gfile , I will submit a PR to fix this shortly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.