Comments (26)
Clap Clap!!
It works!
from wandb-offline-sync-hook.
Just double checking: You do run the wandb-osh
script on your head node as well, right?
The hook (whose output you see above) creates a file in ~/.wandb_osh_command_dir
that tells the wandb-osh
what to sync for every epoch. Every time wandb-osh
then syncs, it removes the file again. If however, the sync hasn't happened yet (so the file still exists) and the next epoch already completes, then you see this warning
from wandb-offline-sync-hook.
Thank you, for your reply.
I am a bit confused on how to use the wandb-osh command.
should I add the command in the shell script where I lauch the python script ?
thank you
from wandb-offline-sync-hook.
Could you describe your setup? Since you're using this package, I assume you are running your ML on a batch system where the compute nodes don't have internet.
In this case, submit your jobs, including the hook in the code as shown on the readme, and then, on the same server where you submitted your jobs, start wandb-osh
in parallel.
from wandb-offline-sync-hook.
Yes, I have a head node from which I lauch jobs on the computing node with sbatch. And yes, the head node has internet and the others don t.
"on the same server where you submitted your jobs, start wandb-osh in parallel." you mean the head node ?
Tell me if this is right:
(HEAD)$ sbatch myscript.sh
(HEAD)$ tmux new -s wosh
(HEAD)$ wandb-osh
the myscript.sh looks like that:
#!/bin/bash
#SBATCH --job-name=pytorch_mnist # job name
#SBATCH --ntasks=1 # number of MP tasks
#SBATCH --ntasks-per-node=1 # number of MPI tasks per node
#SBATCH --gres=gpu:1 # number of GPUs per node
#SBATCH --cpus-per-task=10 # number of cores per tasks
#SBATCH --hint=nomultithread # we get physical cores not logical
#SBATCH --distribution=block:block # we pin the tasks on contiguous cores
#SBATCH --time=3:00:00 # maximum execution time (HH:MM:SS)
#SBATCH --output=pytorch_mnist%j.out # output file name
#SBATCH --error=pytorch_mnist%j.err # error file name
set -x
cd ${SLURM_SUBMIT_DIR}
export WANDB_MODE="offline"
module purge
module load pytorch-gpu/py3/1.11.0
python ./mnist_example.py
from wandb-offline-sync-hook.
I I try what I just proposed I get in std err of my script:
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
and in the tmux session where wandb-osh is running I have:
INFO: Starting to watch /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...
wandb: No runs to be synced.
from wandb-offline-sync-hook.
Yes, that's the correct procedure. The first Syncing not active or too slow
is to be expected (and doesn't matter at all), because you start wandb-osh
after two epochs have already been completed.
The real question is why wandb sync
is showing No runs to be synced
.
I actually think this is a bug in wandb-osh
: It seems to set up wandb/offline-run-20230118_170057-2dyqzdo6/files
for syncing, rather than just wandb/offline-run-20230118_170057-2dyqzdo6/
I've always tested with ray tune
, so that's why I might not have been aware of this.
I will fix this in the next two hours and then let you know. I'd be super happy if you could test again then.
from wandb-offline-sync-hook.
thanks :) I ll do that.
from wandb-offline-sync-hook.
I manage to make it work with
wandb-osh -- --include-offline /gpfswork/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-*
idk if it helps
from wandb-offline-sync-hook.
Yes, that fixes the bug with the wrong run directories that were assumed by wandb_osh
. I've now fixed that in v1.0.3
.
Could you test my fix by updating the package (pip3 install --upgrade wandb_osh
) and then simply trying with wandb-osh
(no other arguments required)
from wandb-offline-sync-hook.
@all-contributors please add @barthelemymp for bug
from wandb-offline-sync-hook.
I've put up a pull request to add @barthelemymp! 🎉
from wandb-offline-sync-hook.
nope: still get
NFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_193002-fnmizw5d/files...
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_193002-fnmizw5d/files...
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_193002-fnmizw5d/files...
wandb: No runs to be synced.
when installing I had to add the path by hand. do you have a command to check that the wandb-osh I call is the updated one ?
from wandb-offline-sync-hook.
Can you check python3 -m pip freeze | grep wandb-osh
for the version?
Because this still looks like it's using the old version...
from wandb-offline-sync-hook.
Alternatively, you can do
import wandb_osh
print(wandb_osh.__version__)
from wandb-offline-sync-hook.
wandb-osh==1.0.3I
: I have the right version, and the problem remains.
Tell ms if I can do some more test on my side.
Best Barthelemy
from wandb-offline-sync-hook.
Just double checking: It's also updated in the python you use in the batch scripts, right? (just in case you use some conda env there, etc.). The fix was related to the hook that is included in the python package, not the wandb-osh
executable.
Because I cannot believe that it still points to the paths that end in /files
with the new version...
You could also do
python -m pip install --upgrade --force-reinstall 'wandb-osh@git+https://github.com/klieret/wandb-offline-sync-hook.git@main'
and then try again, as the newest version now prints out the version number at the beginning
from wandb-offline-sync-hook.
If running your toy analysis is too much work, you can also try this simple snippet here:
#!/usr/bin/env python3
import wandb
import os
from wandb_osh.hooks import TriggerWandbSyncHook
sync_hook = TriggerWandbSyncHook()
os.environ['WANDB_SILENT'] = 'true'
os.environ["WANDB_MODE"] = "offline"
wandb.init()
wandb.log({"loss": 123})
sync_hook()
Run it and it should print something like
INFO: This is wandb-osh v1.0.3 using communication directory /Users/fuchur/.wandb_osh_command_dir
DEBUG: Wrote command file /Users/xxx/.wandb_osh_command_dir/1cf846.command
and if you do cat /Users/xxx/.wandb_osh_command_dir/1cf846.command
(use the path from the debug message you just saw), it should show something like
/Users/xxx/Documents/23/git_sync/wandb-osh-tests/wandb/offline-run-20230118_155559-1rgh98sl
(note how it doesn't end in /files
)
from wandb-offline-sync-hook.
So it is printing the right version:
INFO: wandb-osh v1.0.3, starting to watch /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_234022-59ewemor...
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_234022-59ewemor...
wandb: No runs to be synced.
thank you foryour commitment :)
from wandb-offline-sync-hook.
Yes, now it points to the correct paths; that should work.
Are you running any training in parallel? Because if you synced manually or before, maybe there really is nothing to be synced.
Also, can you check in your script's output what wandb tells you to do for syncing: I usually see something like
wandb: Waiting for W&B process to finish... (success).
wandb:
wandb: Run history:
wandb: iterations_since_restore ▁▃▅▆█
wandb: mean_accuracy ▁▄█▆▇
(...)
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/kl5675/ray_results/ray-tune-slurm-test/Trainable_f15a6ba8_8_conf_out_channels=9,lr=0.0013,momentum=0.1681_2023-01-18_18-36-34/wandb/offline-run-20230118_183635-f15a6ba8
wandb: Find logs at: ./wandb/offline-run-20230118_183635-f15a6ba8/logs
== Status ==
and the path after You can sync this run
should be the same that we see in the output from wandb-osh
from wandb-offline-sync-hook.
Here it is :
DEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/422f4a.command
DEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/422f4a.command
DEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/422f4a.command
wandb: Waiting for W&B process to finish... (success).
wandb:
wandb: Run history:
wandb: avg_a ▁▆▆▇▇▇▇▇█▇████
wandb: avg_e █▃▃▂▁▁▁▁▁▁▁▁▁▁
wandb:
wandb: Run summary:
wandb: avg_a 9909
wandb: avg_e 0.02842
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230119_013623-fuzjfsll
wandb: Find logs at: ./wandb/offline-run-20230119_013623-fuzjfsll/logs
from wandb-offline-sync-hook.
The link shown above has exactly the same structure of the links as shown in the output of wandb-osh
itself... I really don't see how this shouldn't work...
If you had wandb-osh
running in parallel, it probably also showed exactly the path
/gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230119_013623-fuzjfsll
right?
Because that means wandb-osh
runs exactly the command that wandb
suggests in the log...
from wandb-offline-sync-hook.
Just another guess: Could it be that you have started another instance of wandb-osh
in the background? Or something else that already syncs?
In either case, do you see the runs being synced to the wandb web interface? (that still wouldn't say "No runs to be synced")
from wandb-offline-sync-hook.
OK, I found one more thing: On my laptop wandb sync
requires a path, even when in the right directory (else it will exactly show the 'no runs to be synced'), whereas on my cluster it doesn't. It's strange because it's the same version of wandb.
But let me change that in the package real quick.
from wandb-offline-sync-hook.
OK. Could you try
python3 -m pip --upgrade wandb-osh
and try one last time? The version should then be 1.0.4
I'm very sorry to use you as a beta tester here ;) But I'm absolutely confident that it will work now :)
from wandb-offline-sync-hook.
Awesome! Thank you so much again :)
from wandb-offline-sync-hook.
Related Issues (20)
- Document wandb --sync-all option
- Include readme in sphinx HOT 1
- Use sphinx-argparse to document CL tool
- Add timeout to subprocess run
- Make internal methods/modules private; make args kw-only
- automatically add the `--sync-all` argument and sync again if wandb notifies us about it
- Logging broken with ray 3.6
- Auto refresh at web ui HOT 2
- Change logging level HOT 4
- Change default logging level to INFO
- Resuming while offline HOT 1
- Wandb Config Updates are not synced HOT 8
- Timeout option from the CL is ignored HOT 3
- Wandb-osh cannot handle many runs at once HOT 6
- Doc build broken HOT 1
- Runs listed as "Finished" in the WandB portal while still running HOT 3
- `pytorch_lightning` mixed with `lightning.pytorch`: Getting a "ValueError('Expected a parent')" when using a list of Callbacks HOT 4
- W&B config synced only at the end of the run HOT 4
- Is this incremental sync?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from wandb-offline-sync-hook.