Code Monkey home page Code Monkey logo

wandb-offline-sync-hook's Introduction

Wandb Offline Sync Hook

A convenient way to trigger synchronizations to wandb if your compute nodes don't have internet!

Documentation Status PyPI version Python 3.8‒3.11 PR welcome pre-commit.ci status .github/workflows/test.yaml link checker codecov gitmoji Black

🤔 What is this?

  • ✅ You use wandb/Weights & Biases to record your machine learning trials?
  • ✅ Your ML experiments run on compute nodes without internet access (for example, using a batch system)?
  • ✅ Your compute nodes and your head/login node (with internet) have access to a shared file system?

Then this package might be useful. For alternatives, see below.

What you might have been doing so far

You probably have been using export WANDB_MODE="offline" on the compute nodes and then ran something like

cd /.../result_dir/
for d in $(ls -t -d */); do cd $d; wandb sync --sync-all; cd ..; done

from your head node (with internet access) every now and then. However, obviously this is not very satisfying as it doesn't update live. Sure, you could throw this in a while True loop, but if you have a lot of trials in your directory, this will take forever, cause unnecessary network traffic and it's just not very elegant.

How does wandb-osh solve the problem?

  1. You add a hook that is called every time an epoch concludes (that is, when we want to trigger a sync).
  2. You start the wandb-osh script in your head node with internet access. This script will now trigger wandb sync upon request from one of the compute nodes.

How is this implemented?

Very simple: Every time an epoch concludes, the hook gets called and creates a file in the communication directory (~/.wandb_osh_communication by default). The wandb-osh script that is running on the head node (with internet) reads these files and performs the synchronization.

What alternatives are there?

With ray tune, you can use your ray head node as the place to synchronize from (rather than deploying it via the batch system as well, as the current docs suggest). See the note below or my demo repository. Similar strategies might be possible for wandb as well (let me know!).

📦 Installation

pip3 install wandb-osh

For completeness, the extra dependencies lightning and ray are given, but they only ensure that the corresponding package is installed. For example

pip3 install 'wandb-osh[lightning]'

also installs pytorch lightning if it is not already present, but has no other effect.

For development, make sure also to include the testing extra requirement.

pip3 install --editable '.[testing]'

🔥 Running it!

Two steps: Set up the hook, then run the script from your head node.

Step 1: Setting up the hook

With pure wandb

Let's adapt the simple pytorch example from the wandb docs (it only takes 3 lines!):

import wandb
from wandb_osh.hooks import TriggerWandbSyncHook  # <-- New!


trigger_sync = TriggerWandbSyncHook()  # <--- New!

wandb.init(config=args, mode="offline")

model = ... # set up your model

# Magic
wandb.watch(model, log_freq=100)

model.train()
for batch_idx, (data, target) in enumerate(train_loader):
    output = model(data)
    loss = F.nll_loss(output, target)
    loss.backward()
    optimizer.step()
    if batch_idx % args.log_interval == 0:
        wandb.log({"loss": loss})
        trigger_sync()  # <-- New!
With pytorch lightning

Simply add the TriggerWandbSyncLightningCallback to your list of callbacks and you're good to go!

from wandb_osh.lightning_hooks import TriggerWandbSyncLightningCallback  # <-- New!
from pytorch_lightning.loggers import WandbLogger
from pytorch_lightning import Trainer

logger = WandbLogger(
    project="project",
    group="group",
    offline=True,
)

model = MyLightningModule()
trainer = Trainer(
    logger=logger,
    callbacks=[TriggerWandbSyncLightningCallback()]  # <-- New!
)
trainer.fit(model, train_dataloader, val_dataloader)
With ray tune

Note With ray tune, you might not need this package! While the approach suggested in the ray tune SLURM docs deploys the ray head on a worker node as well (so it doesn't have internet), this actually isn't needed. Instead, you can run the ray head and the tuning script on the head node and only submit batch jobs for your workers. In this way, wandb will be called from the head node and internet access is no problem there. For more information on this approach, take a look at my demo repository.

You probably already use the WandbLoggerCallback callback. We simply add a second callback for wandb-osh (it only takes two new lines!):

import os
from wandb_osh.ray_hooks import TriggerWandbSyncRayHook  # <-- New!


os.environ["WANDB_MODE"] = "offline"

callbacks = [
    WandbLoggerCallback(...),  # <-- ray tune documentation tells you about this
    TriggerWandbSyncRayHook(),  # <-- New!
]

tuner = tune.Tuner(
    trainable,
    tune_config=...,
    run_config=RunConfig(
        ...,
        callbacks=callbacks,
    ),
)
With anything else

Simply take the TriggerWandbSyncHook class and use it as a callback in your training loop (as in the wandb example above), passing the directory that wandb is syncing to as an argument.

Step 2: Running the script on the head node

After installation, you should have a wandb-osh script in your $PATH. Simply call it like this:

wandb-osh
The output will look something like this
INFO: Starting to watch /home/kl5675/.wandb_osh_command_dir
INFO: Syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_b1f60706_4_attr_pt_thld=0.0273,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-42
Find logs at: /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_b1f60706_4_attr_pt_thld=0.0273,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-42/wandb/debug-cli.kl5675.log
Syncing: https://wandb.ai/gnn_tracking/gnn_tracking/runs/b1f60706 ... done.
INFO: Finished syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_b1f60706_4_attr_pt_thld=0.0273,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-42
INFO: Syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_92a3ef1b_1_attr_pt_thld=0.0225,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-07-49
Find logs at: /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_92a3ef1b_1_attr_pt_thld=0.0225,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-07-49/wandb/debug-cli.kl5675.log
Syncing: https://wandb.ai/gnn_tracking/gnn_tracking/runs/92a3ef1b ... done.
INFO: Finished syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_92a3ef1b_1_attr_pt_thld=0.0225,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-07-49
INFO: Syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_a2caa9c0_2_attr_pt_thld=0.0092,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-17
Find logs at: /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_a2caa9c0_2_attr_pt_thld=0.0092,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-17/wandb/debug-cli.kl5675.log
Syncing: https://wandb.ai/gnn_tracking/gnn_tracking/runs/a2caa9c0 ... done.

Take a look at wandb-osh --help or check the documentation for all command line options. You can add options to the wandb sync call by placing them after --. For example

wandb-osh -- --sync-all

❓ Q & A

I get the warning "wandb: NOTE: use wandb sync --sync-all to sync 1 unsynced runs from local directory."

You can start wandb-osh with wandb-osh -- --sync-all to always synchronize all available runs.

How can I suppress logging messages (e.g., warnings about the syncing not being fast enough)

import wandb_osh

# for wandb_osh.__version__ >= 1.2.0
wandb_osh.set_log_level("ERROR")

🧰 Development setup

pip3 install pre-commit
pre-commit install

💖 Contributing

Your help is greatly appreciated! Suggestions, bug reports and feature requests are best opened as github issues. You are also very welcome to submit a pull request!

Bug reports and pull requests are credited with the help of the allcontributors bot.

Barthelemy Meynard-Piganeau
Barthelemy Meynard-Piganeau

🐛
MoH-assan
MoH-assan

🐛
Cedric Leonard
Cedric Leonard

💻 🐛

wandb-offline-sync-hook's People

Contributors

allcontributors[bot] avatar cedricleon avatar dependabot[bot] avatar klieret avatar pre-commit-ci[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

wandb-offline-sync-hook's Issues

W&B config synced only at the end of the run

Hi!
This is a question more than an issue

Problem

I am running quite long sweeps offline and I realized that the config of W&B is uploaded only at the end of the run (when mode="offline"). Or worse, if the run "crashes", because of the job time-limit or some other event, the config is never uploaded. Which makes the run pretty useless.

Context

I have seen the issue you opened #6974 as well as this feature recommendation #6952, but I don't see a clean solution (IMO, using Artifacts or Text files completely defeats the purpose).

Setup

I tried forcing the wandb.config at the beginning of the run (instead of waiting for the Lightning.Trainer to initialize it), but it did not seem to change anything. The problem seems to come from the synchronization.
Here is how I do it (using an hydra config):

wandb.config = OmegaConf.to_container(cfg, resolve=True, throw_on_missing=True)
wandb.init(entity=cfg.entity, project=cfg.project, ...)

Do you have a current work around? Or a glimpse of a solution?

Also, thanks for your amazing work, this is a super handy package and by far the simplest solution I have found 😁

Wandb-osh cannot handle many runs at once

Hello,

We are a wandb-osh power user. First of all, thank you for making this excellent utility.
We frequently run 20-30 runs, all logging to wandb, simultaneously. What ends up happening is that the runs which log more frequently crowd out the runs that log less frequently. Therefore, slow runs rarely update to wandb!

If wandb-osh could handle command files in a first-in-first-out fashion rather than "last written", this would fix the issue.

Thank you again for making this. I will attempt to make a PR when I get some time, unless you get to it first.

`pytorch_lightning` mixed with `lightning.pytorch`: Getting a "ValueError('Expected a parent')" when using a list of Callbacks

Context

I see that you are using the old version of Pytorch Lightning in the lightning hooks:

from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.callbacks import Callback

There was some change in lightning a while ago and they renamed the package from pytorch_lightning to lightning. This actually
made quite a mess (you can check here for more details).

Error

From the update, mixing pytorch_lightning with lightning.pytorch is causing an error.
For example when creating a list of Callbacks, some depending on pytorch_lightning.Callbacks and other on lightning.pytorch.Callbacks.
This is what happened to me when I tried to add TriggerWandbSyncLightningCallback to my Trainer.
Here is the generated error:

hydra.errors.InstantiationException: Error in call to target 'lightning.pytorch.trainer.trainer.Trainer':
ValueError('Expected a parent')
full_key: trainer

(I am using hydra and several libraries who made the change to pytorch_lightning)

You can find more detals in the issue #17485.
Also, you can see a similar update/fix in the PR #5028 of Optuna.

Fix

I think simply replacing:

from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.callbacks import Callback

by

from lightning.pytorch import LightningModule, Trainer
from lightning.pytorch.callbacks import Callback

would do the trick

I will try opening a PR to fix that.
Thanks for your work and have a nice day!

Wandb Config Updates are not synced

Hi,

When I intialized wandb with specific configuration (e.g. config_global in the code snippet below.
Then when I update these configuration and trigger sync, these updates are not reflected in the online run.
See the code and and image below.

Is there is a way around this?

Thanks,
Mohamed

os.environ['WANDB_MODE'] = 'offline'
os.environ['WANDB_DIR'] = wandb_dir
import wandb
config_global={}
config_global['dummy1'] = 0
wandb_run = wandb.init(config=config_global, project='SM_H_S')
wandb.config['dummy2']=0
TriggerWandbSyncHook(communication_dir=wandb_ohs_dir)()

image

Timeout option from the CL is ignored

Hi,
I am wondering is wandb-osh --timeout -1 working?

and when I try to set it to negative or extremely large positive value it make no difference.
I still get time out warning.

Update:

command_dir=args.command_dir, wait=args.wait, wandb_options=args.wandb_options

I guess the call to WandbSyncer is missing the time timeout thus the default is used (aka 120 sec)

Thanks

Runs listed as "Finished" in the WandB portal while still running

Hey—thanks for the super-helpful tool!

I have question—and maybe I've just overlooked something—but is there a way to make runs that are still ongoing to be listed as "Running" in the portal with using this tool? The default is for the runs to be "Finished" (see image) even though they're currently running...

I'm running an asynchronous evaluation tool that automatically picks up "Finished" runs, so these half-trained models are currently evaluated prematurely.

Screenshot 2024-02-27 at 09 34 42

Thank you—best,
Lars

Auto refresh at web ui

Can you get wandb's autorefresh work when the run are uploaded by wadnb sync? I found that wandb will tagged them as finished and the auto refresh just not wok

Is this incremental sync?

Hi developers,

Thanks for this fantastic tool. I am wondering if wandb-osh does incremental sync. It won't work after several syncs for my case, and it prompts something like ''timed out. Try later". Thank you.

Sway

Syncing doesn't work with wandb

Hello,

First thank you for creating this tool!
Unfortunately I do not manage to make it work.
I have got this error each time I use trigger_sync:

^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M

I am not sure where it comes from... any idea ?

best

b

Change logging level

First of all, thank you for writing this library. This has saved us a ton of pain with compute nodes that have no internet connectivity.

Is there any way by which we can disable or lower the amount of logging? Particularly, our Slurm stderr files are filled with this on every other line:

12:03:11 WARNING: Syncing not active or too slow: Command /p/home/ritwik/.wandb_osh_command_dir/27111f.command file still exists
WARNING:wandb_osh:Syncing not active or too slow: Command /p/home/ritwik/.wandb_osh_command_dir/27111f.command file still exists
12:03:11 DEBUG: Wrote command file /p/home/ritwik/.wandb_osh_command_dir/27111f.command
DEBUG:wandb_osh:Wrote command file /p/home/ritwik/.wandb_osh_command_dir/27111f.command

It would be great if we could turn this off.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.