Comments (3)
Hi @akrupien ,
I am providing an example of how you can structure your code to use MultiWorkerMirroredStrategy along with saving checkpoints and using callbacks. This example assumes you have a working model training pipeline and focuses on the tensorflow configuration, strategy setup, and saving checkpoints. Please find the gist for reference.
Thank you!
from tensorflow.
Hi @Venkat6871,
Thank you very much for your response and your example! I have changed my code to match your structure, so I only build and compile my model within strategy.scope(). Synchronous training between my machines is still working which is great. I am still having the issue with my Dice Coefficients/Metrics. I'll attach a snippet of the training output here so you have an example.
Epoch 1/400
28/28 [==============================] - ETA: 0s - loss: 3.3312 - dice_coef: 0.0686
Epoch 1: val_loss improved from inf to 1.63522, saving model to /home/path/model1.h5
28/28 [==============================] - 184s 4s/step - loss: 1.6656 - dice_coef: 0.0343 - val_loss: 1.6352 - val_dice_coef: 0.0303
Epoch 2/400
28/28 [==============================] - ETA: 0s - loss: 3.1451 - dice_coef: 0.1314
Epoch 2: val_loss improved from 1.63522 to 1.57489, saving model to /home/path/model1.h5
28/28 [==============================] - 23s 833ms/step - loss: 1.5726 - dice_coef: 0.0657 - val_loss: 1.5749 - val_dice_coef: 0.0431
Epoch 3/400
28/28 [==============================] - ETA: 0s - loss: 3.0354 - dice_coef: 0.1781
Epoch 3: val_loss improved from 1.57489 to 1.53716, saving model to /home/path/model1.h5
28/28 [==============================] - 23s 828ms/step - loss: 1.5177 - dice_coef: 0.0890 - val_loss: 1.5372 - val_dice_coef: 0.0577
Epoch 4/400
28/28 [==============================] - ETA: 0s - loss: 2.9451 - dice_coef: 0.2227
Epoch 4: val_loss improved from 1.53716 to 1.51450, saving model to /home/path/model1.h5
28/28 [==============================] - 23s 831ms/step - loss: 1.4726 - dice_coef: 0.1114 - val_loss: 1.5145 - val_dice_coef: 0.0577
Epoch 5/400
28/28 [==============================] - ETA: 0s - loss: 2.8993 - dice_coef: 0.2236
Epoch 5: val_loss improved from 1.51450 to 1.49228, saving model to /home/path/model1.h5
28/28 [==============================] - 23s 829ms/step - loss: 1.4496 - dice_coef: 0.1118 - val_loss: 1.4923 - val_dice_coef: 0.0577
Epoch 6/400
28/28 [==============================] - ETA: 0s - loss: 2.8554 - dice_coef: 0.2237
Epoch 6: val_loss improved from 1.49228 to 1.47076, saving model to /home/path/model1.h5
28/28 [==============================] - 23s 834ms/step - loss: 1.4277 - dice_coef: 0.1119 - val_loss: 1.4708 - val_dice_coef: 0.0577
You'll notice my dice coefficients are being divided by 2, I believe this is because I am using two machines.
It appears as though it is summing my dice's and losses from each Machine, and showing these values throughout the steps of the epoch, and then when I save it averages the dices and losses between the two machines. (I believe it is summing because of the loss values, my typical loss on a single machine after the first epoch is ~1.6, so a loss of 3.3 only seems achievable to me by summing the losses from each machine). If I turn off the checkpoint saving, it seems to average the dices and losses on the last step of the epoch. I would appreciate some clarification on if this is what is happening. Tensorflow describes NCCL or Ring All reduce to sum the variables between machines, they do not say whether variables get averaged back out. I'd expect it to, but they don't seem to say explicitly anywhere. I am also confused as to why they would show me the summed values during training rather than the averaged value, it seems as though it is training throughout the epoch on an accumulated dice and loss, rather than an average loss and dice between the machines. It would be correct in my mind to train throughout the epoch on the average between workers rather than the sum? Otherwise your loss is artificially high?
Thank you again,
@akrupien
from tensorflow.
Any Ideas?
from tensorflow.
Related Issues (20)
- Conda Release supporting Windows + Python 3.11 HOT 3
- Tensorflow Building from Source Code HOT 1
- which version of keras is used by latest kws_streaming HOT 4
- Convert TFlite buffer created using TF1 to TF2 TFlite buffer HOT 1
- How to contribute?
- Error in computation = RQAComputation.create(settings, verbose=True) when I use HPC. However, I do not encounter any error when I use my personal laptop for the same code and installing the same package : from pyrqa.computation import RQAComputation
- I think we should use a separate api token named for this view. We may have additional clients of the APIs in the future, and we should be able to make the decision to limit one of the other of these endpoints from those additional clients. HOT 2
- Why doesn't `The calling iterator did not fully read the dataset being cached.` appear on Google Colab? HOT 2
- How to turn off mlir during tensorflow2.13 compilation? HOT 1
- Running the same model in TF and TFLiteMicro produces different outputs
- TF-Keras mixed precision training leads to autograph errors HOT 2
- Exception encountered: Unrecognized keyword arguments: ['batch_shape'] HOT 5
- Profiler does not Seem to Output Timesteps in xplane.pb - "No step marker observed and hence the step time is unknown" from Tensorboard HOT 3
- Tensorflow compatibility with pyinstaller HOT 2
- Memory leak when jit compiling
- Add support for TensorRT 10
- `tensorflow::RunOptions::RunOptions(void)` symbol missing in built tensorflow.dll (Windows)
- Cannot find any way to install tensorflow<=2.15.0 HOT 3
- Suspected Corner Case in XLA Compilation - vectorized_sum, conditional, scatter_nd_update Complains about Dynamic Shape when we should Know it
- is_installed check for tensorflow-cpu failed as 'spec is None'
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tensorflow.