tmulc18 / distributed-tensorflow-guide Goto Github PK
View Code? Open in Web Editor NEWDistributed TensorFlow basics and examples of training algorithms
License: MIT License
Distributed TensorFlow basics and examples of training algorithms
License: MIT License
Hello, why can't the optimizer's learning rate use placeholders in distributed code?
I have a single exxact machine with 4 GPUs and was looking into how to run my code in a multi-gpu environment. I am getting the following traceback for the file mentioned above after running the .sh file.
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
Traceback (most recent call last):
File "dist_mult_gpu_sing_mach.py", line 76, in <module>
main()
File "dist_mult_gpu_sing_mach.py", line 29, in main
task_index=FLAGS.task_index,config=config)
UnboundLocalError: local variable 'config' referenced before assignment
E tensorflow/stream_executor/cuda/cuda_driver.cc:509] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: ubuntu
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: ubuntu
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 384.98.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:363] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 384.98 Thu Oct 26 15:16:01 PDT 2017
GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3)
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 384.98.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 384.98.0
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2223, 1 -> localhost:2224}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:211] Started server with target: grpc://localhost:2222
Traceback (most recent call last):
File "dist_mult_gpu_sing_mach.py", line 76, in <module>
main()
File "dist_mult_gpu_sing_mach.py", line 29, in main
task_index=FLAGS.task_index,config=config)
UnboundLocalError: local variable 'config' referenced before assignment
While doing distributed training in multiple computers, do I need to run run.sh in PS or in each computer separately ? BTW both of these cases are not working. When I run my code in single computer with multiple workers it works properly. But I don't know how to run it on multiple computers. Is cluster manager(like Kubernets) required for this purpose?
Maybe we should add a tutorial showing how to set up a cluster in TF and launch a session with the cluster.
Update: The example with local and global variable needs some updates. There needs to be the functionality of pulling global states and assigning them locally. Additionally, we should get a plot of graph given by TensorBoard to show what the local vs global ops and variables look like.
There is an issue with using adagrad instead of sgd on the worker level... the second worker never starts.
The example is brilliant and inspires me a lot.
However it didn't show how to feed data to the graph.
In real practice, I find it impossible to feed different mini-batch data for local optimizer to compute gradients each window-time. Here's what I mean:
It's easy to feed data to the graph by
session.run([opt, ...], feed_dict={})
I think every time session.run(... , feed_dict={...}) is called, the graph uses the same feed_dict to compute gradients by local optimizer window times, although the gradients turn out to be different because last gradients have been applied to local variables the time before this one. However, the feed_dict is the same batch which computed window times then summed up as one set of gradients to apply to the global variables.
I notice that this is different from the pseudocode described in the paper "Large Scale Distributed Deep Networks", and I quote:
Algorithm 7.1: DOWNPOURSGDCLIENT(α, nfetch, npush)
procedure STARTASYNCHRONOUSLYFETCHINGPARAMETERS(parameters)
parameters ← GETPARAMETERSFROMPARAMSERVER()
procedure STARTASYNCHRONOUSLYPUSHINGGRADIENTS(accruedgradients)
SENDGRADIENTSTOPARAMSERVER(accruedgradients)
accruedgradients ← 0
main
global parameters, accruedgradients
step ← 0
accruedgradients ← 0
while true
do
if (step mod nfetch) == 0
then STARTASYNCHRONOUSLYFETCHINGPARAMETERS(parameters)
**data ← GETNEXTMINIBATCH()**
gradient ← COMPUTEGRADIENT(parameters, data)
accruedgradients ← accruedgradients + gradient
parameters ← parameters − α ∗ gradient
if (step mod npush) == 0
then STARTASYNCHRONOUSLYPUSHINGGRADIENTS(accruedgradients)
step ← step + 1
In the pseudocode, every time a mini-batch data is fetched before computing gradients no matter this time is the time to apply them to global variables or not, this is what I bold different mini-batch.
Is there a way to achieve this?
Need to create a local copy of variables based on ones that exist in ps.
In the Synchronous-SGD, I run the ssgd.py and the wrong is happened.
tensorflow.python.framework.errors_impl.InvalidArgumentError: NodeDef missing attr 'reduction_type'
from Op<name=ConditionalAccumulator; signature= -> handle:Ref(string); attr=dtype:type,allowed=[DT_FLOAT, DT_DOUBLE, DT_INT32, DT_UINT8, DT_INT16, ..., DT_UINT16, DT_COMPLEX128, DT_HALF, DT_UINT32, DT_UINT64]; attr=shape:shape; attr=container:string,default=""; attr=shared_name:string,default=""; attr=reduction_type:string,default="MEAN",allowed=["MEAN", "SUM"]; is_stateful=true>; NodeDef: {{node sync_replicas/conditional_accumulator}} = ConditionalAccumulator_class=["loc:@sync_replicas/SetGlobalStep"], container="", dtype=DT_FLOAT, shape=[3,3,3,16], shared_name="conv0/conv:0/grad_accum", _device="/job:ps/replica:0/task:0/device:CPU:0"
First of all thanks a lot for the awesome code collection. I had one query (perhaps a naive one) regarding the following line:
with tf.device('/cpu:0'): a = tf.Variable(tf.truncated_normal(shape=[2]),dtype=tf.float32) b = tf.Variable(tf.truncated_normal(shape=[2]),dtype=tf.float32)
in Distributed setup example. Since this code snippet is executed in the worker, I assume that the variables a and b remain in the worker. Here are my queries:
I am sure these are naive queries but I am still grappling with distributed tensorflow and their documentation does not help my case. Thanks in advance..
As described above, "a_global" is a global variable , I think it will update globally, why is it different in "work1" and "work2" ?
The local init ops should be understood and have comments.
Hi,
This is a very nice write-up! I'm the author of ADAG, I recently changed the name of the optimizer into AGN (Accumulated Gradient Normalization). Which is described in full here https://arxiv.org/abs/1710.02368
Just to keep everything consistent :)
Joeri
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.