Option to use multiple processes/queue for faster syncing

Wandb Offline Sync Hook

A convenient way to trigger synchronizations to wandb if your compute nodes don't have internet!

🤔 What is this?

✅ You use wandb/Weights & Biases to record your machine learning trials?
✅ Your ML experiments run on compute nodes without internet access (for example, using a batch system)?
✅ Your compute nodes and your head/login node (with internet) have access to a shared file system?

Then this package might be useful. For alternatives, see below.

What you might have been doing so far

You probably have been using export WANDB_MODE="offline" on the compute nodes and then ran something like

cd /.../result_dir/
for d in $(ls -t -d */); do cd $d; wandb sync --sync-all; cd ..; done

from your head node (with internet access) every now and then. However, obviously this is not very satisfying as it doesn't update live. Sure, you could throw this in a while True loop, but if you have a lot of trials in your directory, this will take forever, cause unnecessary network traffic and it's just not very elegant.

How does `wandb-osh` solve the problem?

You add a hook that is called every time an epoch concludes (that is, when we want to trigger a sync).
You start the wandb-osh script in your head node with internet access. This script will now trigger wandb sync upon request from one of the compute nodes.

How is this implemented?

Very simple: Every time an epoch concludes, the hook gets called and creates a file in the communication directory (~/.wandb_osh_communication by default). The wandb-osh script that is running on the head node (with internet) reads these files and performs the synchronization.

What alternatives are there?

With ray tune, you can use your ray head node as the place to synchronize from (rather than deploying it via the batch system as well, as the current docs suggest). See the note below or my demo repository. Similar strategies might be possible for wandb as well (let me know!).

📦 Installation

pip3 install wandb-osh

For completeness, the extra dependencies lightning and ray are given, but they only ensure that the corresponding package is installed. For example

pip3 install 'wandb-osh[lightning]'

also installs pytorch lightning if it is not already present, but has no other effect.

For development, make sure also to include the testing extra requirement.

pip3 install --editable '.[testing]'

🔥 Running it!

Two steps: Set up the hook, then run the script from your head node.

Step 1: Setting up the hook

With pure wandb

Let's adapt the simple pytorch example from the wandb docs (it only takes 3 lines!):

import wandb
from wandb_osh.hooks import TriggerWandbSyncHook  # <-- New!


trigger_sync = TriggerWandbSyncHook()  # <--- New!

wandb.init(config=args, mode="offline")

model = ... # set up your model

# Magic
wandb.watch(model, log_freq=100)

model.train()
for batch_idx, (data, target) in enumerate(train_loader):
    output = model(data)
    loss = F.nll_loss(output, target)
    loss.backward()
    optimizer.step()
    if batch_idx % args.log_interval == 0:
        wandb.log({"loss": loss})
        trigger_sync()  # <-- New!

With pytorch lightning

Simply add the TriggerWandbSyncLightningCallback to your list of callbacks and you're good to go!

from wandb_osh.lightning_hooks import TriggerWandbSyncLightningCallback  # <-- New!
from pytorch_lightning.loggers import WandbLogger
from pytorch_lightning import Trainer

logger = WandbLogger(
    project="project",
    group="group",
    offline=True,
)

model = MyLightningModule()
trainer = Trainer(
    logger=logger,
    callbacks=[TriggerWandbSyncLightningCallback()]  # <-- New!
)
trainer.fit(model, train_dataloader, val_dataloader)

With ray tune

Note With ray tune, you might not need this package! While the approach suggested in the ray tune SLURM docs deploys the ray head on a worker node as well (so it doesn't have internet), this actually isn't needed. Instead, you can run the ray head and the tuning script on the head node and only submit batch jobs for your workers. In this way, wandb will be called from the head node and internet access is no problem there. For more information on this approach, take a look at my demo repository.

You probably already use the WandbLoggerCallback callback. We simply add a second callback for wandb-osh (it only takes two new lines!):

import os
from wandb_osh.ray_hooks import TriggerWandbSyncRayHook  # <-- New!


os.environ["WANDB_MODE"] = "offline"

callbacks = [
    WandbLoggerCallback(...),  # <-- ray tune documentation tells you about this
    TriggerWandbSyncRayHook(),  # <-- New!
]

tuner = tune.Tuner(
    trainable,
    tune_config=...,
    run_config=RunConfig(
        ...,
        callbacks=callbacks,
    ),
)

With anything else

Simply take the TriggerWandbSyncHook class and use it as a callback in your training loop (as in the wandb example above), passing the directory that wandb is syncing to as an argument.

Step 2: Running the script on the head node

After installation, you should have a wandb-osh script in your $PATH. Simply call it like this:

wandb-osh

The output will look something like this

INFO: Starting to watch /home/kl5675/.wandb_osh_command_dir
INFO: Syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_b1f60706_4_attr_pt_thld=0.0273,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-42
Find logs at: /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_b1f60706_4_attr_pt_thld=0.0273,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-42/wandb/debug-cli.kl5675.log
Syncing: https://wandb.ai/gnn_tracking/gnn_tracking/runs/b1f60706 ... done.
INFO: Finished syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_b1f60706_4_attr_pt_thld=0.0273,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-42
INFO: Syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_92a3ef1b_1_attr_pt_thld=0.0225,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-07-49
Find logs at: /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_92a3ef1b_1_attr_pt_thld=0.0225,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-07-49/wandb/debug-cli.kl5675.log
Syncing: https://wandb.ai/gnn_tracking/gnn_tracking/runs/92a3ef1b ... done.
INFO: Finished syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_92a3ef1b_1_attr_pt_thld=0.0225,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-07-49
INFO: Syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_a2caa9c0_2_attr_pt_thld=0.0092,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-17
Find logs at: /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_a2caa9c0_2_attr_pt_thld=0.0092,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-17/wandb/debug-cli.kl5675.log
Syncing: https://wandb.ai/gnn_tracking/gnn_tracking/runs/a2caa9c0 ... done.

Take a look at wandb-osh --help or check the documentation for all command line options. You can add options to the wandb sync call by placing them after --. For example

wandb-osh -- --sync-all

❓ Q & A

I get the warning "wandb: NOTE: use wandb sync --sync-all to sync 1 unsynced runs from local directory."

You can start wandb-osh with wandb-osh -- --sync-all to always synchronize all available runs.

How can I suppress logging messages (e.g., warnings about the syncing not being fast enough)

import wandb_osh

# for wandb_osh.__version__ >= 1.2.0
wandb_osh.set_log_level("ERROR")

🧰 Development setup

pip3 install pre-commit
pre-commit install

💖 Contributing

Your help is greatly appreciated! Suggestions, bug reports and feature requests are best opened as github issues. You are also very welcome to submit a pull request!

Bug reports and pull requests are credited with the help of the allcontributors bot.