Code Monkey home page Code Monkey logo

Comments (26)

barthelemymp avatar barthelemymp commented on June 16, 2024 1

Clap Clap!!

It works!

from wandb-offline-sync-hook.

klieret avatar klieret commented on June 16, 2024

Just double checking: You do run the wandb-osh script on your head node as well, right?

The hook (whose output you see above) creates a file in ~/.wandb_osh_command_dir that tells the wandb-osh what to sync for every epoch. Every time wandb-osh then syncs, it removes the file again. If however, the sync hasn't happened yet (so the file still exists) and the next epoch already completes, then you see this warning

from wandb-offline-sync-hook.

barthelemymp avatar barthelemymp commented on June 16, 2024

Thank you, for your reply.
I am a bit confused on how to use the wandb-osh command.
should I add the command in the shell script where I lauch the python script ?
thank you

from wandb-offline-sync-hook.

klieret avatar klieret commented on June 16, 2024

Could you describe your setup? Since you're using this package, I assume you are running your ML on a batch system where the compute nodes don't have internet.
In this case, submit your jobs, including the hook in the code as shown on the readme, and then, on the same server where you submitted your jobs, start wandb-osh in parallel.

from wandb-offline-sync-hook.

barthelemymp avatar barthelemymp commented on June 16, 2024

Yes, I have a head node from which I lauch jobs on the computing node with sbatch. And yes, the head node has internet and the others don t.
"on the same server where you submitted your jobs, start wandb-osh in parallel." you mean the head node ?

Tell me if this is right:
(HEAD)$ sbatch myscript.sh
(HEAD)$ tmux new -s wosh
(HEAD)$ wandb-osh

the myscript.sh looks like that:

#!/bin/bash
#SBATCH --job-name=pytorch_mnist     # job name
#SBATCH --ntasks=1                   # number of MP tasks
#SBATCH --ntasks-per-node=1          # number of MPI tasks per node
#SBATCH --gres=gpu:1                 # number of GPUs per node
#SBATCH --cpus-per-task=10           # number of cores per tasks
#SBATCH --hint=nomultithread         # we get physical cores not logical
#SBATCH --distribution=block:block   # we pin the tasks on contiguous cores
#SBATCH --time=3:00:00              # maximum execution time (HH:MM:SS)
#SBATCH --output=pytorch_mnist%j.out # output file name
#SBATCH --error=pytorch_mnist%j.err  # error file name

set -x
cd ${SLURM_SUBMIT_DIR}
export WANDB_MODE="offline"
module purge
module load pytorch-gpu/py3/1.11.0

python ./mnist_example.py 

from wandb-offline-sync-hook.

barthelemymp avatar barthelemymp commented on June 16, 2024

I I try what I just proposed I get in std err of my script:

^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M

and in the tmux session where wandb-osh is running I have:

INFO: Starting to watch /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...                                                           
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...                                                           
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...                                                           
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...                                                           
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...                                                           
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...                                                           
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...                                                           
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...                                                           
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...                                                           
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...                                                           
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...                                                           
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...                                                           
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...                                                           
wandb: No runs to be synced.

from wandb-offline-sync-hook.

klieret avatar klieret commented on June 16, 2024

Yes, that's the correct procedure. The first Syncing not active or too slow is to be expected (and doesn't matter at all), because you start wandb-osh after two epochs have already been completed.

The real question is why wandb sync is showing No runs to be synced.

I actually think this is a bug in wandb-osh: It seems to set up wandb/offline-run-20230118_170057-2dyqzdo6/files for syncing, rather than just wandb/offline-run-20230118_170057-2dyqzdo6/

I've always tested with ray tune, so that's why I might not have been aware of this.

I will fix this in the next two hours and then let you know. I'd be super happy if you could test again then.

from wandb-offline-sync-hook.

barthelemymp avatar barthelemymp commented on June 16, 2024

thanks :) I ll do that.

from wandb-offline-sync-hook.

barthelemymp avatar barthelemymp commented on June 16, 2024

I manage to make it work with
wandb-osh -- --include-offline /gpfswork/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-*
idk if it helps

from wandb-offline-sync-hook.

klieret avatar klieret commented on June 16, 2024

Yes, that fixes the bug with the wrong run directories that were assumed by wandb_osh. I've now fixed that in v1.0.3.

Could you test my fix by updating the package (pip3 install --upgrade wandb_osh) and then simply trying with wandb-osh (no other arguments required)

from wandb-offline-sync-hook.

klieret avatar klieret commented on June 16, 2024

@all-contributors please add @barthelemymp for bug

from wandb-offline-sync-hook.

allcontributors avatar allcontributors commented on June 16, 2024

@klieret

I've put up a pull request to add @barthelemymp! 🎉

from wandb-offline-sync-hook.

barthelemymp avatar barthelemymp commented on June 16, 2024

nope: still get

NFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_193002-fnmizw5d/files...
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_193002-fnmizw5d/files...
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_193002-fnmizw5d/files...
wandb: No runs to be synced.

when installing I had to add the path by hand. do you have a command to check that the wandb-osh I call is the updated one ?

from wandb-offline-sync-hook.

klieret avatar klieret commented on June 16, 2024

Can you check python3 -m pip freeze | grep wandb-osh for the version?
Because this still looks like it's using the old version...

from wandb-offline-sync-hook.

klieret avatar klieret commented on June 16, 2024

Alternatively, you can do

import wandb_osh
print(wandb_osh.__version__)

from wandb-offline-sync-hook.

barthelemymp avatar barthelemymp commented on June 16, 2024

wandb-osh==1.0.3I : I have the right version, and the problem remains.

Tell ms if I can do some more test on my side.

Best Barthelemy

from wandb-offline-sync-hook.

klieret avatar klieret commented on June 16, 2024

Just double checking: It's also updated in the python you use in the batch scripts, right? (just in case you use some conda env there, etc.). The fix was related to the hook that is included in the python package, not the wandb-osh executable.

Because I cannot believe that it still points to the paths that end in /files with the new version...

You could also do

python -m pip install --upgrade --force-reinstall 'wandb-osh@git+https://github.com/klieret/wandb-offline-sync-hook.git@main'

and then try again, as the newest version now prints out the version number at the beginning

from wandb-offline-sync-hook.

klieret avatar klieret commented on June 16, 2024

If running your toy analysis is too much work, you can also try this simple snippet here:

#!/usr/bin/env python3

import wandb
import os
from wandb_osh.hooks import TriggerWandbSyncHook

sync_hook = TriggerWandbSyncHook()

os.environ['WANDB_SILENT'] = 'true'
os.environ["WANDB_MODE"] = "offline"
wandb.init()
wandb.log({"loss": 123})
sync_hook()

Run it and it should print something like

INFO: This is wandb-osh v1.0.3 using communication directory /Users/fuchur/.wandb_osh_command_dir
DEBUG: Wrote command file /Users/xxx/.wandb_osh_command_dir/1cf846.command

and if you do cat /Users/xxx/.wandb_osh_command_dir/1cf846.command (use the path from the debug message you just saw), it should show something like

/Users/xxx/Documents/23/git_sync/wandb-osh-tests/wandb/offline-run-20230118_155559-1rgh98sl

(note how it doesn't end in /files)

from wandb-offline-sync-hook.

barthelemymp avatar barthelemymp commented on June 16, 2024

So it is printing the right version:

INFO: wandb-osh v1.0.3, starting to watch /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_234022-59ewemor...
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_234022-59ewemor...
wandb: No runs to be synced.

thank you foryour commitment :)

from wandb-offline-sync-hook.

klieret avatar klieret commented on June 16, 2024

Yes, now it points to the correct paths; that should work.

Are you running any training in parallel? Because if you synced manually or before, maybe there really is nothing to be synced.

Also, can you check in your script's output what wandb tells you to do for syncing: I usually see something like

wandb: Waiting for W&B process to finish... (success).
wandb:
wandb: Run history:
wandb: iterations_since_restore ▁▃▅▆█
wandb:            mean_accuracy ▁▄█▆▇
(...)
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/kl5675/ray_results/ray-tune-slurm-test/Trainable_f15a6ba8_8_conf_out_channels=9,lr=0.0013,momentum=0.1681_2023-01-18_18-36-34/wandb/offline-run-20230118_183635-f15a6ba8
wandb: Find logs at: ./wandb/offline-run-20230118_183635-f15a6ba8/logs
== Status ==

and the path after You can sync this run should be the same that we see in the output from wandb-osh

from wandb-offline-sync-hook.

barthelemymp avatar barthelemymp commented on June 16, 2024

Here it is :

DEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/422f4a.command
DEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/422f4a.command
DEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/422f4a.command
wandb: Waiting for W&B process to finish... (success).
wandb: 
wandb: Run history:
wandb: avg_a ▁▆▆▇▇▇▇▇█▇████
wandb: avg_e █▃▃▂▁▁▁▁▁▁▁▁▁▁
wandb: 
wandb: Run summary:
wandb: avg_a 9909
wandb: avg_e 0.02842
wandb: 
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230119_013623-fuzjfsll
wandb: Find logs at: ./wandb/offline-run-20230119_013623-fuzjfsll/logs

from wandb-offline-sync-hook.

klieret avatar klieret commented on June 16, 2024

The link shown above has exactly the same structure of the links as shown in the output of wandb-osh itself... I really don't see how this shouldn't work...

If you had wandb-osh running in parallel, it probably also showed exactly the path

/gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230119_013623-fuzjfsll

right?
Because that means wandb-osh runs exactly the command that wandb suggests in the log...

from wandb-offline-sync-hook.

klieret avatar klieret commented on June 16, 2024

Just another guess: Could it be that you have started another instance of wandb-osh in the background? Or something else that already syncs?

In either case, do you see the runs being synced to the wandb web interface? (that still wouldn't say "No runs to be synced")

from wandb-offline-sync-hook.

klieret avatar klieret commented on June 16, 2024

OK, I found one more thing: On my laptop wandb sync requires a path, even when in the right directory (else it will exactly show the 'no runs to be synced'), whereas on my cluster it doesn't. It's strange because it's the same version of wandb.

But let me change that in the package real quick.

from wandb-offline-sync-hook.

klieret avatar klieret commented on June 16, 2024

OK. Could you try

python3 -m pip --upgrade wandb-osh

and try one last time? The version should then be 1.0.4

I'm very sorry to use you as a beta tester here ;) But I'm absolutely confident that it will work now :)

from wandb-offline-sync-hook.

klieret avatar klieret commented on June 16, 2024

Awesome! Thank you so much again :)

from wandb-offline-sync-hook.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.