Code Monkey home page Code Monkey logo

stable-audio-tools's Introduction

stable-audio-tools

Training and inference code for audio generation models

Install

The library can be installed from PyPI with:

$ pip install stable-audio-tools

To run the training scripts or inference code, you'll want to clone this repository, navigate to the root, and run:

$ pip install .

Requirements

Requires PyTorch 2.0 or later for Flash Attention support

Development for the repo is done in Python 3.8.10

Interface

A basic Gradio interface is provided to test out trained models.

For example, to create an interface for the stable-audio-open-1.0 model, once you've accepted the terms for the model on Hugging Face, you can run:

$ python3 ./run_gradio.py --pretrained-name stabilityai/stable-audio-open-1.0

The run_gradio.py script accepts the following command line arguments:

  • --pretrained-name
    • Hugging Face repository name for a Stable Audio Tools model
    • Will prioritize model.safetensors over model.ckpt in the repo
    • Optional, used in place of model-config and ckpt-path when using pre-trained model checkpoints on Hugging Face
  • --model-config
    • Path to the model config file for a local model
  • --ckpt-path
    • Path to unwrapped model checkpoint file for a local model
  • --pretransform-ckpt-path
    • Path to an unwrapped pretransform checkpoint, replaces the pretransform in the model, useful for testing out fine-tuned decoders
    • Optional
  • --share
    • If true, a publicly shareable link will be created for the Gradio demo
    • Optional
  • --username and --password
    • Used together to set a login for the Gradio demo
    • Optional
  • --model-half
    • If true, the model weights to half-precision
    • Optional

Training

Prerequisites

Before starting your training run, you'll need a model config file, as well as a dataset config file. For more information about those, refer to the Configurations section below

The training code also requires a Weights & Biases account to log the training outputs and demos. Create an account and log in with:

$ wandb login

Start training

To start a training run, run the train.py script in the repo root with:

$ python3 ./train.py --dataset-config /path/to/dataset/config --model-config /path/to/model/config --name harmonai_train

The --name parameter will set the project name for your Weights and Biases run.

Training wrappers and model unwrapping

stable-audio-tools uses PyTorch Lightning to facilitate multi-GPU and multi-node training.

When a model is being trained, it is wrapped in a "training wrapper", which is a pl.LightningModule that contains all of the relevant objects needed only for training. That includes things like discriminators for autoencoders, EMA copies of models, and all of the optimizer states.

The checkpoint files created during training include this training wrapper, which greatly increases the size of the checkpoint file.

unwrap_model.py in the repo root will take in a wrapped model checkpoint and save a new checkpoint file including only the model itself.

That can be run with from the repo root with:

$ python3 ./unwrap_model.py --model-config /path/to/model/config --ckpt-path /path/to/wrapped/ckpt --name model_unwrap

Unwrapped model checkpoints are required for:

  • Inference scripts
  • Using a model as a pretransform for another model (e.g. using an autoencoder model for latent diffusion)
  • Fine-tuning a pre-trained model with a modified configuration (i.e. partial initialization)

Fine-tuning

Fine-tuning a model involves continuning a training run from a pre-trained checkpoint.

To continue a training run from a wrapped model checkpoint, you can pass in the checkpoint path to train.py with the --ckpt-path flag.

To start a fresh training run using a pre-trained unwrapped model, you can pass in the unwrapped checkpoint to train.py with the --pretrained-ckpt-path flag.

Additional training flags

Additional optional flags for train.py include:

  • --config-file
    • The path to the defaults.ini file in the repo root, required if running train.py from a directory other than the repo root
  • --pretransform-ckpt-path
    • Used in various model types such as latent diffusion models to load a pre-trained autoencoder. Requires an unwrapped model checkpoint.
  • --save-dir
    • The directory in which to save the model checkpoints
  • --checkpoint-every
    • The number of steps between saved checkpoints.
    • Default: 10000
  • --batch-size
    • Number of samples per-GPU during training. Should be set as large as your GPU VRAM will allow.
    • Default: 8
  • --num-gpus
    • Number of GPUs per-node to use for training
    • Default: 1
  • --num-nodes
    • Number of GPU nodes being used for training
    • Default: 1
  • --accum-batches
    • Enables and sets the number of batches for gradient batch accumulation. Useful for increasing effective batch size when training on smaller GPUs.
  • --strategy
    • Multi-GPU strategy for distributed training. Setting to deepspeed will enable DeepSpeed ZeRO Stage 2.
    • Default: ddp if --num_gpus > 1, else None
  • --precision
    • floating-point precision to use during training
    • Default: 16
  • --num-workers
    • Number of CPU workers used by the data loader
  • --seed
    • RNG seed for PyTorch, helps with deterministic training

Configurations

Training and inference code for stable-audio-tools is based around JSON configuration files that define model hyperparameters, training settings, and information about your training dataset.

Model config

The model config file defines all of the information needed to load a model for training or inference. It also contains the training configuration needed to fine-tune a model or train from scratch.

The following properties are defined in the top level of the model configuration:

  • model_type
    • The type of model being defined, currently limited to one of "autoencoder", "diffusion_uncond", "diffusion_cond", "diffusion_cond_inpaint", "diffusion_autoencoder", "lm".
  • sample_size
    • The length of the audio provided to the model during training, in samples. For diffusion models, this is also the raw audio sample length used for inference.
  • sample_rate
    • The sample rate of the audio provided to the model during training, and generated during inference, in Hz.
  • audio_channels
    • The number of channels of audio provided to the model during training, and generated during inference. Defaults to 2. Set to 1 for mono.
  • model
    • The specific configuration for the model being defined, varies based on model_type
  • training
    • The training configuration for the model, varies based on model_type. Provides parameters for training as well as demos.

Dataset config

stable-audio-tools currently supports two kinds of data sources: local directories of audio files, and WebDataset datasets stored in Amazon S3. More information can be found in the dataset config documentation

Todo

  • Add troubleshooting section
  • Add contribution guidelines

stable-audio-tools's People

Contributors

akx avatar carlthome avatar eltociear avatar kadarakos avatar piwell avatar underdogest avatar zqevans avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stable-audio-tools's Issues

If train.py runs with 'strategy=None' it throws a ValueError. Maybe 'auto' should be the default?

Running train.py with a single GPU and no --strategy parameter gives the error:

ValueError: You selected an invalid strategy name: `strategy=None`. It must be either a string or an instance of `pytorch_lightning.strategies.Strategy`. Example choices: auto, ddp, ddp_spawn, deepspeed, ... Find a complete list of options in our documentation at https://lightning.ai

It's the same whether --num-gpus is set to 1 or not set.

Running with --strategy SingleDeviceStrategy gives the error:

ValueError: You selected an invalid strategy name: `strategy='singledevicestrategy'`. It must be either a string or an instance of `pytorch_lightning.strategies.Strategy`. Example choices: auto, ddp, ddp_spawn, deepspeed, ... Find a complete list of options in our documentation at https://lightning.ai

Running with --strategy auto works.

Convert ckpt to pt file for inference

Is it possible to use torch.jit.trace (or torch.jit.script) to save the model as a .pt ? I'm assuming due to the complexity torch.jit.script would be the way to go.

Multiple global conditioning values with a local dataset.

I am having issues with creating a config for using gloabl conditioning values. I cannot peice this together from the examples provided and am not sure where the issue comes from, I suspect it is in my training config.

I have audio recordings tabulated with environmental data, and want to use 8 floating point parameters for global conditioning using the 'adp_cfg_1d' conditional diffusion model.

I have prepared metadata in a single JSON file, but adjusted the custom metadata function to return a dictionary of all 8 parameters.

It would be great if anyone can shed some light on this.

Here is my config - I am using a pretrained autoencoder. I have added the conditioning details:

modelConfig = {
  "model_type": "diffusion_cond",
  "sample_size": 524288,
  "sample_rate": 44100,
  "audio_channels": 1,
  "model": {
    "io_channels": 64,
    "diffusion": {
      "type": "adp_cfg_1d",
      "supports_global_cond": True,
      "global_cond_ids": [
        "latitude",
        "longitude",
        "temperature",
        "humidity",
        "wind_speed",
        "pressure",
        "minutes_of_day",
        "day_of_year"
      ],
    
      "config": {
        "channels": 64,
        "context_embedding_max_length": 8,
        "context_embedding_features": 8,
        "in_channels": 64,
        "multipliers": [
          1,
          2,
          4,
          8,
          16
        ],
        "factors": [
          1,
          2,
          2,
          2
        ],
        "num_blocks": [
          1,
          2,
          2,
          2
        ],
        "attentions": [
          0,
          1,
          1,
          1,
          1
        ],
        "attention_heads": 8,
        "attention_multiplier": 2
      }
    },
    "pretransform": {
      "type": "dac_pretrained",
      "config": {}
    },
    "conditioning": {
      "configs": [
        {
          "id": "latitude",
          "type": "number",
          "config": {
            "min_val": 0.19656993333333334,
            "max_val": 50.443667
          }
        },
        {
          "id": "longitude",
          "type": "number",
          "config": {
            "min_val": 0.7689536111111112,
            "max_val": 0.9665608333333334
          }
        },
        {
          "id": "temperature",
          "type": "number",
          "config": {
            "min_val": -3.24085585279757,
            "max_val": 3.6700142339763984
          }
        },
        {
          "id": "humidity",
          "type": "number",
          "config": {
            "min_val": -3.8110683227566686,
            "max_val": 1.4696676807777593
          }
        },
        {
          "id": "wind_speed",
          "type": "number",
          "config": {
            "min_val": -1.6188376276678578,
            "max_val": 11.608313462165052
          }
        },
        {
          "id": "pressure",
          "type": "number",
          "config": {
            "min_val": -6.263994121817382,
            "max_val": 5.390325035103388
          }
        },
        {
          "id": "minutes_of_day",
          "type": "number",
          "config": {
            "min_val": 0,
            "max_val": 0.9993055555555556
          }
        },
        {
          "id": "day_of_year",
          "type": "number",
          "config": {
            "min_val": 0,
            "max_val": 1.0027472527472527
          }
        }
      ],
      "cond_dim": 8

    }
  },
  "training": {
    "learning_rate": 0.00004,
    "demo": {
      "demo_every": 1500,
      "demo_steps": 100,
      "num_demos": 3,
      "demo_cfg_scales": [
        1,
        1,
        1,
        1,
        1,
        1,  
        1,
        1
      ],
      "demo_cond": [
        {
          "latitude": 0.337113507707468,
          "longitude": 0.8998659802839629,
          "temperature": 2.0902047266579145e-16,
          "humidity": -5.249300186068055e-17,
          "wind_speed": -1.7127625984753777e-16,
          "pressure": -3.1158006270597913e-15,
          "minutes_of_day": 0.3977426990645676,
          "day_of_year": 0.5956888170306616
        },
        {
          "latitude": 0.38461336306221783,
          "longitude": 0.9281043879416349,
          "temperature": 1.0000000000000002,
          "humidity": 1,
          "wind_speed": 0.9999999999999998,
          "pressure": 0.9999999999999969,
          "minutes_of_day": 0.6036759423548516,
          "day_of_year": 0.8767792548963371
        },
        {
          "latitude": 0.2896136523527182,
          "longitude": 0.871627572626291,
          "temperature": -0.9999999999999998,
          "humidity": -1,
          "wind_speed": -1.0000000000000002,
          "pressure": -1.000000000000003,
          "minutes_of_day": 0.19180945577428368,
          "day_of_year": 0.31459837916498595
        }
      ]
    }
  }
}

Here is my custom metadata file:

import json

# Load the metadata from the JSON file into a list for quick access.
metadata_file_path = 'path/to/all_metadata.json'
with open(metadata_file_path, 'r') as file:
    audio_metadata_list = json.load(file) 

def get_custom_metadata(info, audio):
    # Extract the filename from the `info` parameter.
    audio_filename = info.get("relpath", "").split('/')[-1].replace('_P.wav', '')
    #find the entry in the JSON file.
    metadata_entry = next((item for item in audio_metadata_list if item["filename"] == audio_filename), None)

    # Default values for all keys.
    metadata_for_entry = {
        "latitude": 0.0,
        "longitude": 0.0,
        "temperature": 0.0,
        "humidity": 0.0,
        "wind_speed": 0.0,
        "pressure": 0.0,
        "minutes_of_day": 0.0,
        "day_of_year": 0.0,
    }

    if metadata_entry:
        for key in metadata_for_entry.keys():
            if key in metadata_entry:
                metadata_for_entry[key] = metadata_entry[key]

    return metadata_for_entry

I get this error, the feedback and examples do not cover the required model config for conditional training like this, so it is difficult to debug.

File "./train.py", line 110, in main
    trainer.fit(training_wrapper, train_dl, ckpt_path=args.ckpt_path if args.ckpt_path else None)
  File "/usr/local/envs/myenv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/envs/myenv/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/usr/local/envs/myenv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/envs/myenv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _run
    results = self._run_stage()
  File "/usr/local/envs/myenv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1036, in _run_stage
    self.fit_loop.run()
  File "/usr/local/envs/myenv/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
    self.advance()
  File "/usr/local/envs/myenv/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/usr/local/envs/myenv/lib/python3.8/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
    self.advance(data_fetcher)
  File "/usr/local/envs/myenv/lib/python3.8/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 240, in advance
    batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
  File "/usr/local/envs/myenv/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 187, in run
    self._optimizer_step(batch_idx, closure)
  File "/usr/local/envs/myenv/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 265, in _optimizer_step
    call._call_lightning_module_hook(
  File "/usr/local/envs/myenv/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/usr/local/envs/myenv/lib/python3.8/site-packages/pytorch_lightning/core/module.py", line 1282, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/usr/local/envs/myenv/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 151, in step
    step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
  File "/usr/local/envs/myenv/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 230, in optimizer_step
    return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
  File "/usr/local/envs/myenv/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/amp.py", line 77, in optimizer_step
    closure_result = closure()
  File "/usr/local/envs/myenv/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 140, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/usr/local/envs/myenv/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/envs/myenv/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 126, in closure
    step_output = self._step_fn()
  File "/usr/local/envs/myenv/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 315, in _training_step
    training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
  File "/usr/local/envs/myenv/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/usr/local/envs/myenv/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 382, in training_step
    return self.lightning_module.training_step(*args, **kwargs)
  File "/content/stable-audio-tools/stable_audio_tools/training/diffusion.py", line 406, in training_step
    v = self.diffusion(noised_inputs, t, cond=conditioning, cfg_dropout_prob = 0.1, **extra_args)
  File "/usr/local/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1561, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/content/stable-audio-tools/stable_audio_tools/models/diffusion.py", line 180, in forward
    return self.model(x, t, **self.get_conditioning_inputs(cond), **kwargs)
  File "/usr/local/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/content/stable-audio-tools/stable_audio_tools/models/diffusion.py", line 225, in forward
    outputs = self.model(
  File "/usr/local/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/content/stable-audio-tools/stable_audio_tools/models/adp.py", line 1280, in forward
    b, device = embedding.shape[0], embedding.device
AttributeError: 'NoneType' object has no attribute 'shape'

how to use a trained model?

Hello,
I just trained a small test model, but I couldn't find a way to use it. How can I use a text prompt in Python to generate a .mp3 or .wav file with my model? With the "run_gradio.py" script, I could only upload other audio files, but not text prompts, and it also comes with a UI.
Thanks.

Questions about Autoencoder's GAN loss

I have a couple of questions regarding the EncodecDiscriminator:
When calculating discriminative loss, should the output from the generator be detached?
I noticed that the generator hinge loss differs from that used in HiFi-GAN and DAC. Specifically, is using -score_fake.mean() better than torch.mean((1-score_fake)**2)?

here is the modified Discriminator:

class EncodecDiscriminator(nn.Module):

def __init__(self, *args, **kwargs):
    super().__init__()

    from encodec.msstftd import MultiScaleSTFTDiscriminator

    self.discriminators = MultiScaleSTFTDiscriminator(*args, **kwargs)

def forward(self, x):
    logits, features = self.discriminators(x)
    return logits, features

def discriminator_hinge_loss(self, score_real, score_fake):
    dis_loss = torch.relu(1 - score_real).mean() + torch.relu(1 + score_fake).mean()
    return dis_loss

def generator_hinge_loss(self, score_fake):
    gen_loss = -score_fake.mean()
    return gen_loss

def loss(self, x, y):
    feature_matching_distance = 0.
    logits_true, feature_true = self.forward(x)
    logits_fake, feature_fake = self.forward(y)

    dis_loss = torch.tensor(0., device=x.device)
    adv_loss = torch.tensor(0., device=x.device)

    # Detach y for the discriminator loss computation
    logits_fake_detached, _ = self.forward(y.detach())

    for i, (scale_true, scale_fake) in enumerate(zip(feature_true, feature_fake)):

        feature_matching_distance = feature_matching_distance + sum(
            map(
                lambda x, y: abs(x - y).mean(),
                scale_true,
                scale_fake,
            )) / len(scale_true)

        # Discriminator loss should use detached logits
        _dis = self.discriminator_hinge_loss(
            logits_true[i],
            logits_fake_detached[i],
        )
        dis_loss = dis_loss + _dis

        # Adversarial loss should use non-detached logits
        _adv = self.generator_hinge_loss(
            logits_fake[i],
        )
        adv_loss = adv_loss + _adv

    return dis_loss, adv_loss, feature_matching_distance

Init audio conditioning doesn't work with --model-half

When starting the gradio interface on Windows with --model-half, both init audio and audio inpaint result in an error regarding tensors and halftensors. I've properly followed the guidelines on the repo readme to install the library. Everything else works perfectly fine!

Here's the command used to start the interface:

python run_gradio.py --ckpt-path model.ckpt --model-config model_config.json --model-half

Here's the error log:

Traceback (most recent call last): File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\gradio\queueing.py", line 521, in process_events response = await route_utils.call_process_api( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\gradio\route_utils.py", line 276, in call_process_api output = await app.get_blocks().process_api( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\gradio\blocks.py", line 1935, in process_api result = await self.call_function( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\gradio\blocks.py", line 1513, in call_function prediction = await anyio.to_thread.run_sync( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\anyio\to_thread.py", line 56, in run_sync return await get_async_backend().run_sync_in_worker_thread( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\anyio\_backends\_asyncio.py", line 2177, in run_sync_in_worker_thread return await future ^^^^^^^^^^^^ File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\anyio\_backends\_asyncio.py", line 859, in run result = context.run(func, *args) ^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\gradio\utils.py", line 832, in wrapper response = f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "C:\Users\User\Documents\AI\stable-audio-tools\stable_audio_tools\interface\gradio.py", line 167, in generate_cond audio = generate_diffusion_cond( ^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\User\Documents\AI\stable-audio-tools\stable_audio_tools\inference\generation.py", line 179, in generate_diffusion_cond init_audio = model.pretransform.encode(init_audio) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\User\Documents\AI\stable-audio-tools\stable_audio_tools\models\pretransforms.py", line 56, in encode encoded = self.model.encode_audio(x, chunked=self.chunked, iterate_batch=self.iterate_batch, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\User\Documents\AI\stable-audio-tools\stable_audio_tools\models\autoencoders.py", line 445, in encode_audio return self.encode(audio, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\User\Documents\AI\stable-audio-tools\stable_audio_tools\models\autoencoders.py", line 302, in encode latents.append(self.encoder(audio[i:i+1])) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\User\Documents\AI\stable-audio-tools\stable_audio_tools\models\autoencoders.py", line 147, in forward return self.layers(x) ^^^^^^^^^^^^^^ File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\nn\modules\container.py", line 215, in forward input = module(input) ^^^^^^^^^^^^^ File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1568, in _call_impl result = forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\nn\modules\conv.py", line 310, in forward return self._conv_forward(input, self.weight, self.bias) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\nn\modules\conv.py", line 306, in _conv_forward return F.conv1d(input, weight, bias, self.stride, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same

`ImportError: cannot import name 'packaging' from 'pkg_resources'` from `run_gradio.py`

Traceback:

(venv) PS stable-audio-tools> python run_gradio.py --ckpt-path ./models/model.ckpt --model-config ./models/model_config.json
Traceback (most recent call last):
  File "run_gradio.py", line 2, in <module>
    from stable_audio_tools.interface.gradio import create_ui
  File "stable-audio-tools\stable_audio_tools\interface\gradio.py", line 14, in <module>
    from ..inference.generation import generate_diffusion_cond, generate_diffusion_uncond
  File "stable-audio-tools\stable_audio_tools\inference\generation.py", line 8, in <module>
    from .sampling import sample, sample_k
  File "stable-audio-tools\stable_audio_tools\inference\sampling.py", line 5, in <module>
    import k_diffusion as K
  File "stable-audio-tools\venv\lib\site-packages\k_diffusion\__init__.py", line 1, in <module>
    from . import augmentation, config, evaluation, external, gns, layers, models, sampling, utils
  File "stable-audio-tools\venv\lib\site-packages\k_diffusion\evaluation.py", line 6, in <module>
    import clip
  File "stable-audio-tools\venv\lib\site-packages\clip\__init__.py", line 1, in <module>
    from .clip import *
  File "stable-audio-tools\venv\lib\site-packages\clip\clip.py", line 6, in <module>
    from pkg_resources import packaging
ImportError: cannot import name 'packaging' from 'pkg_resources' (stable-audio-tools\venv\lib\site-packages\pkg_resources\__init__.py)

They're having the same issue over on AUTOMATIC1111/stable-diffusion-webui#15863

Something about the setuptools==70.0.0 release broke pkg_resources and it looks like there are no plans to fix it... Not sure where that leaves clip

The command that worked for me is
python -m pip install setuptools==69.5.1

(with the stable-audio-tools venv activated)

I can run python run_gradio.py plus arguments and it works after downgrading.

Installation not completing - botocore

Hi, I'm trying to install via pip install stable-audio-tools. This results in an extremely long-winded installation process where an enormous amount of different versions of botocoreand aiobotocore are installed. I assume this is done by pip in order to resolve dependencies. I currently have not been able to complete the installation due to these installs not finishing.

Just wanted to see if this problem is on my end, or if it might be an issue with the package installation.

I am on an M3 Pro-chip using Python 3.8.10 via pyenv.

Attaching my terminal output for reference.

debug.txt

ModuleNotFoundError: No module named 'packaging' from pip install stable-audio-tools

 Downloading flash_attn-2.5.9.post1.tar.gz (2.6 MB)
     ---------------------------------------- 2.6/2.6 MB 6.4 MB/s eta 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error

  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [20 lines of output]
      Traceback (most recent call last):
        File "C:\StableAudio\venv\Lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 353, in <module>
          main()
        File "C:\StableAudio\venv\Lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "C:\StableAudio\venv\Lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)
                 ^^^^^^^^^^^^^^^^^^^^^
        File "C:\Users\Krzysztof Jankowski\AppData\Local\Temp\pip-build-env-e_t_1lmm\overlay\Lib\site-packages\setuptools\build_meta.py", line 325, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=['wheel'])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "C:\Users\Krzysztof Jankowski\AppData\Local\Temp\pip-build-env-e_t_1lmm\overlay\Lib\site-packages\setuptools\build_meta.py", line 295, in _get_build_requires
          self.run_setup()
        File "C:\Users\Krzysztof Jankowski\AppData\Local\Temp\pip-build-env-e_t_1lmm\overlay\Lib\site-packages\setuptools\build_meta.py", line 487, in run_setup
          super().run_setup(setup_script=setup_script)
        File "C:\Users\Krzysztof Jankowski\AppData\Local\Temp\pip-build-env-e_t_1lmm\overlay\Lib\site-packages\setuptools\build_meta.py", line 311, in run_setup
          exec(code, locals())
        File "<string>", line 9, in <module>
      ModuleNotFoundError: No module named 'packaging'
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

Same error on:

  • Ubuntu
  • Windows 10 (miniconda3)

Both using venv.

PicklingError: Can't pickle <function get_custom_metadata at 0x00000216EE5ADDA0>: import of module 'metadata_module' failed

Hey,
if i want to train my model with costum audios and promts via metadata i just get this traceback:

PicklingError: Can't pickle <function get_custom_metadata at 0x00000216EE5ADDA0>: import of module 'metadata_module' failed
Traceback (most recent call last):
File "", line 1, in
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\multiprocessing\spawn.py", line 122, in spawn_main
exitcode = _main(fd, parent_sentinel)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\multiprocessing\spawn.py", line 132, in _main
self = reduction.pickle.load(from_parent)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
EOFError: Ran out of input
Traceback (most recent call last):
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\site-packages\pytorch_lightning\trainer\call.py", line 44, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\site-packages\pytorch_lightning\trainer\trainer.py", line 581, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\site-packages\pytorch_lightning\trainer\trainer.py", line 990, in _run
results = self._run_stage()
^^^^^^^^^^^^^^^^^
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1036, in _run_stage
self.fit_loop.run()
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\site-packages\pytorch_lightning\loops\fit_loop.py", line 194, in run
self.setup_data()
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\site-packages\pytorch_lightning\loops\fit_loop.py", line 258, in setup_data
iter(self._data_fetcher) # creates the iterator inside the fetcher
^^^^^^^^^^^^^^^^^^^^^^^^
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\site-packages\pytorch_lightning\loops\fetchers.py", line 99, in iter
super().iter()
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\site-packages\pytorch_lightning\loops\fetchers.py", line 48, in iter
self.iterator = iter(self.combined_loader)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\site-packages\pytorch_lightning\utilities\combined_loader.py", line 335, in iter
iter(iterator)
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\site-packages\pytorch_lightning\utilities\combined_loader.py", line 87, in iter
super().iter()
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\site-packages\pytorch_lightning\utilities\combined_loader.py", line 40, in iter
self.iterators = [iter(iterable) for iterable in self.iterables]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\site-packages\pytorch_lightning\utilities\combined_loader.py", line 40, in
self.iterators = [iter(iterable) for iterable in self.iterables]
^^^^^^^^^^^^^^
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\site-packages\torch\utils\data\dataloader.py", line 434, in iter
self._iterator = self._get_iterator()
^^^^^^^^^^^^^^^^^^^^
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\site-packages\torch\utils\data\dataloader.py", line 387, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\site-packages\torch\utils\data\dataloader.py", line 1040, in init
w.start()
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
^^^^^^^^^^^^^^^^^
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\multiprocessing\context.py", line 336, in _Popen
return Popen(process_obj)
^^^^^^^^^^^^^^^^^^
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\multiprocessing\popen_spawn_win32.py", line 95, in init
reduction.dump(process_obj, to_child)
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <function get_custom_metadata at 0x00000216EE5ADDA0>: import of module 'metadata_module' failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "G:\Programms\stable audio\stable-audio-tools\train.py", line 125, in
main()
File "G:\Programms\stable audio\stable-audio-tools\train.py", line 120, in main
trainer.fit(training_wrapper, train_dl,
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\site-packages\pytorch_lightning\trainer\trainer.py", line 545, in fit
call._call_and_handle_interrupt(
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\site-packages\pytorch_lightning\trainer\call.py", line 68, in _call_and_handle_interrupt
trainer._teardown()
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1017, in _teardown
loop.teardown()
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\site-packages\pytorch_lightning\loops\fit_loop.py", line 407, in teardown
self._data_fetcher.teardown()
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\site-packages\pytorch_lightning\loops\fetchers.py", line 75, in teardown
self.reset()
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\site-packages\pytorch_lightning\loops\fetchers.py", line 134, in reset
super().reset()
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\site-packages\pytorch_lightning\loops\fetchers.py", line 71, in reset
self.length = sized_len(self.combined_loader)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\site-packages\lightning_fabric\utilities\data.py", line 51, in sized_len
length = len(dataloader) # type: ignore [arg-type]
^^^^^^^^^^^^^^^
File "g:\Programms\stable audio\stable-audio-tools.conda\Lib\site-packages\pytorch_lightning\utilities\combined_loader.py", line 342, in len
raise RuntimeError("Please call iter(combined_loader) first.")
RuntimeError: Please call iter(combined_loader) first.

Any idea how to fix it?

My custom_metadata.py looks like this:

`import os

def get_custom_metadata(info, audio):
# Der Pfad zur Audiodatei
audio_path = info["relpath"]

# Der Pfad zur entsprechenden Textdatei
text_path = os.path.splitext(audio_path)[0] + '.txt'

# Lesen Sie den Text aus der Textdatei
with open(text_path, 'r') as f:
    text = f.read()

# Geben Sie den Text als "prompt" zurück
return {"prompt": text}

`

my dataset config:
{ "dataset_type": "audio_dir", "datasets": [ { "id": "my_audio", "path": "./trainingdata", "random_crop": true } ], "custom_metadata_module": "./custom_metadata.py" }

and my model config:

{ "model_type": "diffusion_cond", "sample_size": 4194304, "sample_rate": 44100, "audio_channels": 2, "model": { "pretransform": { "type": "autoencoder", "iterate_batch": true, "config": { "encoder": { "type": "dac", "config": { "in_channels": 2, "latent_dim": 128, "d_model": 128, "strides": [4, 4, 8, 8] } }, "decoder": { "type": "dac", "config": { "out_channels": 2, "latent_dim": 64, "channels": 1536, "rates": [8, 8, 4, 4] } }, "bottleneck": { "type": "vae" }, "latent_dim": 64, "downsampling_ratio": 1024, "io_channels": 2 } }, "conditioning": { "configs": [ { "id": "prompt", "type": "clap_text", "config": { "audio_model_type": "HTSAT-base", "enable_fusion": true, "clap_ckpt_path": "./clapmodel/music_audioset_epoch_15_esc_90.14.pt", "use_text_features": true, "feature_layer_ix": -2 } }, { "id": "seconds_start", "type": "int", "config": { "min_val": 0, "max_val": 512 } }, { "id": "seconds_total", "type": "int", "config": { "min_val": 0, "max_val": 512 } } ], "cond_dim": 768 }, "diffusion": { "type": "adp_cfg_1d", "cross_attention_cond_ids": ["prompt", "seconds_start", "seconds_total"], "config": { "in_channels": 64, "context_embedding_features": 768, "context_embedding_max_length": 79, "channels": 256, "resnet_groups": 16, "kernel_multiplier_downsample": 2, "multipliers": [4, 4, 4, 5, 5], "factors": [1, 2, 2, 4], "num_blocks": [2, 2, 2, 2], "attentions": [1, 3, 3, 3, 3], "attention_heads": 16, "attention_multiplier": 4, "use_nearest_upsample": false, "use_skip_scale": true, "use_context_time": true } }, "io_channels": 64 }, "training": { "learning_rate": 4e-5, "demo": { "demo_every": 2000, "demo_steps": 250, "num_demos": 4, "demo_cond": [ {"prompt": "80s style Whipcrack snare drum loop, 120BPM, retro, funk, energetic, nostalgia, dance, disco", "seconds_start": 0, "seconds_total": 30}, {"prompt": "Dubstep style drum loop 5, 110BPM, fast, energy, meditative, spiritual, zen, calming, focus, introspection", "seconds_start": 0, "seconds_total": 30}, {"prompt": "Dance Pop style Synth Chorus loop, fast, energy, fun, lively, upbeat, catchy, indie, fresh, vibrant", "seconds_start": 0, "seconds_total": 30}, {"prompt": "Hip hop style piano loop, 150BPM, Key A, joyful, playful, funny, upbeat, lively, cheerful, entertaining", "seconds_start": 0, "seconds_total": 30} ], "demo_cfg_scales": [3, 6, 9] } } }

Using Win 11.

Thanks! :)

Conditioned diffusion model training demo results in noise

Hi, thank you so much for open-sourcing this amazing repo!!

I was trying to run a simple training on the conditioned diffusion model with 1710 of 3-minute audio tracks, but I noticed that even after 100 epochs the demo output is still full of noise. I have tried tweaking some of the parameters in the training configs. Based on your provided information, I have created a pre-transformed autoencoder before training and provided the ckpt during training. But nothing helps. Could you kindly take a look at the config I used and let me know missed miss anything or where could go wrong?

Thank you so much for your time and help!!

media_images_demo_melspec_left_cfg_1 0_2001_0d56e82c8de45944915b

unwrap_autoencoder: 
{
    "model_type": "autoencoder",
    "sample_rate": 44100,
    "audio_channels": 1,
    "model": {
        "encoder": {
            "type": "dac",
            "config": {
                "in_channels": 1,
                "latent_dim": 256,
                "d_model": 128, 
                "strides": [4, 8, 8, 8]
            }
        },
        "decoder": {
            "type": "dac",
            "config": {
                "out_channels": 1,
                "latent_dim": 128, 
                "channels": 1536,
                "rates": [8, 8, 8, 4]
            }
        },
        "bottleneck": {
            "type": "vae"
        
        },
        "latent_dim": 128, 
        "downsampling_ratio": 2048,
        "io_channels": 1
    },
    "training": {
        "learning_rate": 1e-4,
        "warmup_steps": 0,
        "use_ema": false,
        "loss_configs": {
            "discriminator": {
                "type": "encodec",
                "config":{
                    "filters": 32,
                    "n_ffts": [2048, 1024, 512, 256, 128, 64, 32],
                    "hop_lengths": [512, 256, 128, 64, 32, 16, 8],
                    "win_lengths": [2048, 1024, 512, 256, 128, 64, 32]
                },
                "weights": {
                    "adversarial": 0.1,
                    "feature_matching": 5.0
                }
            },
            "spectral": {
                "type": "mrstft",
                "config": {
                    "fft_sizes": [2048, 1024, 512, 256, 128, 64, 32],
                    "hop_sizes": [512, 256, 128, 64, 32, 16, 8],
                    "win_lengths": [2048, 1024, 512, 256, 128, 64, 32],
                    "perceptual_weighting": true
                },
                "weights": {
                    "mrstft": 1.0
                }
            },
            "time": {
                "type": "l1",
                "weights": {
                    "l1": 0.0
                }
            }
        },
        "demo": {
            "demo_every": 2000
        }
}
}

training_config:
{
    "model_type": "diffusion_cond",
    "sample_size": 2097152,
    "sample_rate": 44100,
    "audio_channels": 1,
    "model": {
        "io_channels": 128,
        "diffusion":{
            "type": "adp_cfg_1d",
            "config": {
                "channels": 128,
                "context_embedding_max_length": 100, 
                "context_embedding_features": 768,
                "in_channels": 128,  
                "multipliers": [1, 2, 4, 8, 16],  
                "factors": [1, 2, 2, 2],  
                "num_blocks": [1, 2, 2, 2],  
                "attentions": [0, 1, 1, 1, 1],
                "attention_heads": 8, 
                "attention_multiplier": 2
        }},        
        "pretransform": {
            "type": "autoencoder",
            "config": {
                "encoder": {
                    "type": "dac",
                    "config": {
                        "in_channels": 1,
                        "latent_dim": 256,
                        "d_model": 128, 
                        "strides": [4, 8, 8, 8]
                    }
                },
                "decoder": {
                    "type": "dac",
                    "config": {
                        "out_channels": 1,
                        "latent_dim": 128, 
                        "channels": 1536,
                        "rates": [8, 8, 8, 4]
                    }
                },
                "bottleneck": {
                    "type": "vae"
                
                },
                "latent_dim": 128, 
                "downsampling_ratio": 2048,
                "io_channels": 1
            }
        },
        "conditioning": {
            "configs": [
                {
                    "id": "prompt",
                    "output_dim":768,
                    "type": "t5",
                    "config": {
                        "t5_model_name": "t5-base",
                        "max_length": 100,
                        "project_out": true
                    }
                },
                {
                    "id": "seconds_start",
                    "type": "int",
                    "config": {
                        "min_val": 0,
                        "max_val": 512
                    }
                },
                {
                    "id": "seconds_total",
                    "type": "number",
                    "config": {
                        "min_val": 0,
                        "max_val": 512
                    }
                }
                    
            ],
            "cond_dim": 768
        }
    },
    "training": {
        "learning_rate": 1e-4,
        "demo": {
            "demo_every": 2000,
            "demo_steps": 250,
            "num_demos": 1,
            "demo_cfg_scales": [1.0],
            "demo_cond":
                [{"prompt":"Mix of guitar, piano, bass, drums",
                  "seconds_start": 1,
                  "seconds_total": 600.0}]
            
        }
    }
}

Custom metadata function: 
def get_custom_metadata(info, audio):
    metadata_file = info['path'].replace("mix.flac", "metadata.yaml")
    # print(f"metadata path exist? : {os.path.exists(metadata_file)}")
    with open(metadata_file) as f:
        metadata = yaml.safe_load(f)
    # print("Metadata function called!")
    instruments = [stem['inst_class'] for stem in metadata['stems'].values()]
    # print(f'getting instrument metadata: {instruments}')
    custom_metadata = {
        # 'instruments': [stem['inst_class'] for stem in metadata['stems'].values()],
        "prompt": 'Mix of ' + ', '.join(set(instruments))
    }
    # print(f"custom_metadata: {custom_metadata}")
    return custom_metadata

Pickling Error with custom_metadata_module in DataLoader

Description

When running the train.py script to train an audio model, I encounter a pickling error related to the get_custom_metadata function specified in the local_training_example.json configuration. The error seems to indicate an issue with importing the module metadata_module during the multiprocessing of PyTorch's DataLoader.

Steps to Reproduce

Run the train.py script with the provided dataset and model configuration JSON files.
Encounter the _pickle.PicklingError during the script execution.

Expected Behavior

The DataLoader should be able to use the get_custom_metadata function from the metadata_module without any pickling errors.

Actual Behavior

The script throws a _pickle.PicklingError, indicating the function get_custom_metadata cannot be pickled.

Environment

Operating System: Windows 10
Python Version: 3.9.13
PyTorch Version: 2.1.1 (with CUDA 12.1 support)
CUDA Version: 12.1
cuDNN Version: 8801

Configuration Files

local_training_example.json:

{
  "dataset_type": "audio_dir",
  "datasets": [
    {
      "id": "nsynth_train",
      "path": "E:/nsynth/nsynth-train/audio/"
    },
    {
      "id": "nsynth_valid",
      "path": "E:/nsynth/nsynth-valid/audio/"
    },
    {
      "id": "nsynth_test",
      "path": "E:/nsynth/nsynth-test/audio/"
    }
  ],
  "random_crop": true,
  "custom_metadata_module": "./stable_audio_tools/configs/dataset_configs/custom_metadata/custom_md_example.py"
}

custom_md_example.py:

def get_custom_metadata(info, audio):
  # Use relative path as the prompt
  return {"prompt": info["relpath"]}

console output:

(venv) PS C:...\stable-audio-tools-main> python ./train.py --dataset-config ./stable_audio_tools/configs/dataset_configs/local_training_example.json --model-config ./stable_audio_tools/configs/model_configs/autoencoders/stable_audio_1_0_vae.json --name dedai
Found 305979 files
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
C:...\venv\lib\site-packages\torch\nn\utils\weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
...
EOFError: Ran out of input
_pickle.PicklingError: Can't pickle <function get_custom_metadata at 0x0000014C62C10820>: import of module 'metadata_module' failed
...

Training from scratch on RTX 3090 doable?

Hi, I was wondering if this is a small model that is suitable for personal training, or is this release more intended for academics with huge funding? If the latter, then I assume fine-tuning at least will be doable?

Cheers!

training in Colab A100 vs V100

Hi, I am trying to train using google colab, it works fine with a V100, but if I get an A100 I cannot run train.py

I get this output (the exact same settings work fine for V100 and I tried using 16-mixed and bf16-mixed:

You are using a CUDA device ('NVIDIA A100-SXM4-40GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type                  | Params
--------------------------------------------------------
0 | diffusion     | DiffusionModelWrapper | 221 M 
1 | diffusion_ema | EMA                   | 442 M 
2 | losses        | MultiLoss             | 0     
--------------------------------------------------------
221 M     Trainable params
221 M     Non-trainable params
442 M     Total params
1,771.115 Total estimated model params size (MB)
Epoch 0:   0% 0/128 [00:00<?, ?it/s] /content/stable-audio-tools/stable_audio_tools/models/blocks.py:68: UserWarning: Memory efficient kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:415.)
  y = F.scaled_dot_product_attention(q, k, v, is_causal=False).contiguous().view([n, c, s])
/content/stable-audio-tools/stable_audio_tools/models/blocks.py:68: UserWarning: Memory Efficient attention has been runtime disabled. (Triggered internally at ../aten/src/ATen/native/transformers/sdp_utils_cpp.h:456.)
  y = F.scaled_dot_product_attention(q, k, v, is_causal=False).contiguous().view([n, c, s])
/content/stable-audio-tools/stable_audio_tools/models/blocks.py:68: UserWarning: Flash attention kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:417.)
  y = F.scaled_dot_product_attention(q, k, v, is_causal=False).contiguous().view([n, c, s])
/content/stable-audio-tools/stable_audio_tools/models/blocks.py:68: UserWarning: Both fused kernels require the last dimension of the input to have stride 1. Got Query.stride(-1): 256, Key.stride(-1): 256, Value.stride(-1): 256 instead. (Triggered internally at ../aten/src/ATen/native/transformers/sdp_utils_cpp.h:423.)
  y = F.scaled_dot_product_attention(q, k, v, is_causal=False).contiguous().view([n, c, s])
RuntimeError: No available kernel. Aborting execution.

Any clues on how to solve this?

Possibility of the generate songs like suno

Great work! I'd also like to ask if you've tried using lyrics as input to generate the corresponding singing, similar to the suno approach. Does the DIT structure support this form of generation?

Fail to train dac with rvq model in multi-gpu

hi,

i try to train the dac VAE, but i turns out nan loss in epoch3, then i train the dac RVQ, but i can only train on single gpu, multi gpu will fail.

i train the dac VAE because the musicgen model training require such compression component, but there is not offical config of the dac RVQ.

Thanks for your kind help

the detail log of the dac RVQ is

Epoch 0:   0%| | 2/48248 [00:15<106:04:35,  7.92s/it, v_num=q5e6, train/loss=8.820, train/mrstft_loss=7.590, train/l1_time_loss=0.000, train/loss_adv=-.00369, train/feature_matching=0.145, train/latenRuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 6 13 20 27 34 41 48 55
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/torchaudio/transforms/_transforms.py:580: UserWarning: Argument 'onesided' has been deprecated and has no influence on the behavior of this module.
  warnings.warn(
/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/torchaudio/functional/functional.py:584: UserWarning: At least one mel filterbank has all zero values. The value for `n_mels` (128) may be set too high. Or, the value for `n_freqs` (513) may be set too low.
  warnings.warn(
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 6 13 20 27 34 41 48 55
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 3: 6 13 20 27 34 41 48 55
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
Traceback (most recent call last):
  File "/home/test/code/audio/test/stable-audio-tools/egs/RVQ_dac/src/train.py", line 103, in <module>
    main()
  File "/home/test/code/audio/test/stable-audio-tools/egs/RVQ_dac/src/train.py", line 100, in main
    trainer.fit(training_wrapper, train_dl, ckpt_path=args.ckpt_path if args.ckpt_path else None)
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
    call._call_and_handle_interrupt(
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 980, in _run
    results = self._run_stage()
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1023, in _run_stage
    self.fit_loop.run()
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
    self.advance()
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 355, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 133, in run
    self.advance(data_fetcher)
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 221, in advance
    batch_output = self.manual_optimization.run(kwargs)
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/manual.py", line 91, in run
    self.advance(kwargs)
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/manual.py", line 111, in advance
    training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 294, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 330, in training_step
    return self.model(*args, **kwargs)
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1515, in forward
    inputs, kwargs = self._pre_forward(*inputs, **kwargs)
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1409, in _pre_forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable the detection of unused parameters in DDP, either by setting the string value `strategy='ddp_find_unused_parameters_true'` or by setting the flag in the strategy with `strategy=DDPStrategy(find_unused_parameters=True)`.
Traceback (most recent call last):
  File "/home/test/code/audio/test/stable-audio-tools/egs/RVQ_dac/src/train.py", line 103, in <module>
    main()
  File "/home/test/code/audio/test/stable-audio-tools/egs/RVQ_dac/src/train.py", line 100, in main
    trainer.fit(training_wrapper, train_dl, ckpt_path=args.ckpt_path if args.ckpt_path else None)
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
    call._call_and_handle_interrupt(
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 980, in _run
    results = self._run_stage()
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1023, in _run_stage
    self.fit_loop.run()
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
    self.advance()
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 355, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 133, in run
    self.advance(data_fetcher)
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 221, in advance
    batch_output = self.manual_optimization.run(kwargs)
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/manual.py", line 91, in run
    self.advance(kwargs)
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/manual.py", line 111, in advance
    training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 294, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 330, in training_step
    return self.model(*args, **kwargs)
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1515, in forward
    inputs, kwargs = self._pre_forward(*inputs, **kwargs)
  File "/home/test/python_env/anaconda3/envs/StableAudio/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1409, in _pre_forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():

the detail config of the dac RVQ is

"model_type": "autoencoder",
    "sample_size": 16384,
    "sample_rate": 48000,
    "audio_channels": 1,
    "model": {
        "encoder": {
            "type": "dac",
            "config": {
                "latent_dim": 32,
                "d_model": 128,
                "strides": [4, 4, 8, 8]
            }
        },
        "decoder": {
            "type": "dac",
            "config": {
                "latent_dim": 32,
                "channels": 1024,
                "rates": [8, 4, 8, 4]
            }
        },
        "bottleneck": {
            "type": "dac_rvq",
            "config": {
                "input_dim": 32,
                "n_codebooks": 8,
                "codebook_dim": 32,
                "codebook_size": 1024,
                "quantizer_dropout": 0.0
            }
        },
        "latent_dim": 32,
        "downsampling_ratio": 1024,
        "io_channels": 1
    },

UNet1DUncondWrapper Initialization error due to missing in_channels argument

UNet1DUncondWrapper Initialization error due to missing in_channels argument

Description

I encountered an issue where the diffusion config does not appear to be passed correctly to UNet1DUncondWrapper from create_diffusion_uncond_from_config.

Despite specifying the in_channels in the diffusion configuration, the initialization of UNet1DUncondWrapper fails with an argument missing error.

Steps to Reproduce

  1. Use the following model configuration:

    {
      "model_type": "diffusion_uncond",
      "sample_size": 524288,
      "sample_rate": 44100,
      "audio_channels": 1,
      "model": {
        "type": "adp_uncond_1d",
        "diffusion": {
          "config": {
            "in_channels": 64
          }
        },
        "pretransform": {
          "type": "dac_pretrained",
          "config": {}
        },
        "training": {
          "learning_rate": 0.00004,
          "demo": {
            "demo_every": 1500,
            "demo_steps": 100,
            "num_demos": 1
          }
        }
      }
    }
  2. Pass this config to train.py (this configuration is incomplete, but offered as a minimal example as it produces the error reliably.

Expected Behavior

The model initialises and the in_channels parameter is set from the config.

Actual Behavior

The model initialization fails with the following error message:

```plaintext
File "/content/stable-audio-tools/stable_audio_tools/models/diffusion.py", line 568, in create_diffusion_uncond_from_config
    model = UNet1DUncondWrapper(
TypeError: __init__() missing 1 required positional argument: 'in_channels'
```

It seems that the diffusion:config may not be being properly passed to UNet1DUncondWrapper. Despite in_channels being explicitly defined and set, the value is not recognised.

good model_config and train_conig example

Does anybody have a 'good' model and training configuration .json to share? I am looking for some examples for experimenting,
mine right now looks like:
"
`{
"model_type": "autoencoder",
"sample_size": 32768,
"sample_rate": 44100,
"audio_channels": 1,
"model": {
"encoder": {
"type": "dac",
"config": {
"latent_dim": 64,
"d_model": 32,
"strides": [4, 4, 8, 8]
}
},
"decoder": {
"type": "dac",
"config": {
"latent_dim": 64,
"channels": 512,
"rates": [8, 4, 8, 4]
}
},
"bottleneck": {
"type": "vae"
},
"latent_dim": 64,
"downsampling_ratio": 1024,
"io_channels": 1
},
"training": {
"learning_rate": 1e-4,
"warmup_steps": 0,
"use_ema": false,
"loss_configs": {
"discriminator": {
"type": "encodec",
"config": {
"filters": 32,
"n_ffts": [2048, 1024, 512, 256, 128, 64, 32],
"hop_lengths": [512, 256, 128, 64, 32, 16, 8],
"win_lengths": [2048, 1024, 512, 256, 128, 64, 32]
},
"weights": {
"adversarial": 0.1,
"feature_matching": 5.0
}
},
"spectral": {
"type": "mrstft",
"config": {
"fft_sizes": [2048, 1024, 512, 256, 128, 64, 32],
"hop_sizes": [512, 256, 128, 64, 32, 16, 8],
"win_lengths": [2048, 1024, 512, 256, 128, 64, 32],
"perceptual_weighting": true
},
"weights": {
"mrstft": 1.0
}
},
"time": {
"type": "l1",
"weights": {
"l1": 0.0
}
}
},
"demo": {
"demo_every": 4000
}
}
}
"

and:
{
"dataset_type": "audio_dir",
"datasets": [
{
"id": "my_audio",
"path": "./trainingdata",
"random_crop": true
}
]
}
`

I also want to do training with my own prompts and cloud to figure out how to work with training data correctly. I want to use the title of all the audio training data as the matching prompt. So, if you have any tips...

Yeah, the documentation is hard for people who aren't quite experienced.

Thanks :)

ModuleNotFoundError: No module named 'pathtools.patterns'

Installed according to the setup guide by doing these steps:

  1. git clone https://github.com/Stability-AI/stable-audio-tools
  2. create and activate conda environment with Python 3.8.10
  3. "pip install stable-audio-tools"
  4. "pip install ."

When I run "python run_gradio.py --pretrained-name stabilityai/stable-audio-open-1.0" I get the following error:

Traceback (most recent call last):
  File "C:\Users\xxx\miniconda3\envs\stable-audio-tools_env\lib\site-packages\wandb\vendor\watchdog_0_9_0\wandb_watchdog\observers\__init__.py", line 90, in <module>
    from .read_directory_changes import WindowsApiObserver as Observer
  File "C:\Users\xxx\miniconda3\envs\stable-audio-tools_env\lib\site-packages\wandb\vendor\watchdog_0_9_0\wandb_watchdog\observers\read_directory_changes.py", line 26, in <module>
    from wandb_watchdog.events import (
  File "C:\Users\xxx\miniconda3\envs\stable-audio-tools_env\lib\site-packages\wandb\vendor\watchdog_0_9_0\wandb_watchdog\events.py", line 91, in <module>
    from pathtools.patterns import match_any_paths
ModuleNotFoundError: No module named 'pathtools.patterns'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_gradio.py", line 2, in <module>
    from stable_audio_tools.interface.gradio import create_ui
  File "C:\Users\xxx\Documents\machine_learning\audio_stable-audio-tools\stable_audio_tools\interface\gradio.py", line 8, in <module>
    from aeiou.viz import audio_spectrogram_image
  File "C:\Users\xxx\miniconda3\envs\stable-audio-tools_env\lib\site-packages\aeiou\viz.py", line 29, in <module>
    import wandb
  File "C:\Users\xxx\miniconda3\envs\stable-audio-tools_env\lib\site-packages\wandb\__init__.py", line 26, in <module>
    from wandb import sdk as wandb_sdk
  File "C:\Users\xxx\miniconda3\envs\stable-audio-tools_env\lib\site-packages\wandb\sdk\__init__.py", line 8, in <module>
    from .wandb_init import _attach, init  # noqa: F401
  File "C:\Users\xxx\miniconda3\envs\stable-audio-tools_env\lib\site-packages\wandb\sdk\wandb_init.py", line 32, in <module>
    from .backend.backend import Backend
  File "C:\Users\xxx\miniconda3\envs\stable-audio-tools_env\lib\site-packages\wandb\sdk\backend\backend.py", line 20, in <module>
    from ..internal.internal import wandb_internal
  File "C:\Users\xxx\miniconda3\envs\stable-audio-tools_env\lib\site-packages\wandb\sdk\internal\internal.py", line 33, in <module>
    from . import context, handler, internal_util, sender, settings_static, writer
  File "C:\Users\xxx\miniconda3\envs\stable-audio-tools_env\lib\site-packages\wandb\sdk\internal\sender.py", line 31, in <module>
    from wandb.filesync.dir_watcher import DirWatcher
  File "C:\Users\xxx\miniconda3\envs\stable-audio-tools_env\lib\site-packages\wandb\filesync\dir_watcher.py", line 22, in <module>
    wd_polling = util.vendor_import("wandb_watchdog.observers.polling")
  File "C:\Users\xxx\miniconda3\envs\stable-audio-tools_env\lib\site-packages\wandb\util.py", line 178, in vendor_import
    module = import_module(name)
  File "C:\Users\xxx\miniconda3\envs\stable-audio-tools_env\lib\importlib\__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "C:\Users\xxx\miniconda3\envs\stable-audio-tools_env\lib\site-packages\wandb\vendor\watchdog_0_9_0\wandb_watchdog\observers\__init__.py", line 92, in <module>
    from .polling import PollingObserver as Observer
  File "C:\Users\xxx\miniconda3\envs\stable-audio-tools_env\lib\site-packages\wandb\vendor\watchdog_0_9_0\wandb_watchdog\observers\polling.py", line 50, in <module>
    from wandb_watchdog.events import (
  File "C:\Users\xxx\miniconda3\envs\stable-audio-tools_env\lib\site-packages\wandb\vendor\watchdog_0_9_0\wandb_watchdog\events.py", line 91, in <module>
    from pathtools.patterns import match_any_paths
ModuleNotFoundError: No module named 'pathtools.patterns'

I tried installing pathtools manually with "pip install pathtools" but it is already installed. Anyone figured this out?

Setup:
Win 11
RTX 3090

dependencies problem

Hi there,

I've been attempting to install the stable-audio-tools package locally by cloning the repository, due to some pip installation dependency issues. However, in both cases, I've encountered a specific problem related to the pedalboard dependency version. The installation process is halted with the following error messages:

ERROR: Could not find a version that satisfies the requirement pedalboard==0.7.4 (from stable-audio-tools) (from versions: 0.8.2, 0.8.3, 0.8.4, 0.8.5, 0.8.6, 0.8.7, 0.8.8, 0.8.9)
ERROR: No matching distribution found for pedalboard==0.7.4

It seems that stable-audio-tools requires pedalboard==0.7.4, but only newer versions of pedalboard are available.
I'm keen to hear if there's a workaround or if an update to accommodate newer pedalboard versions is possible.

Thank you in advance

CUDA not recognized, also -1 seed not working.

OS: Windows 10
GPU: Nvidia RTX 3060

My issues is that i cannot use the default -1 seed and CUDA is not detected, for some reason.

ValueError: high is out of bounds for int32

I have to manually input any seed that is not -1.
And when i run this with a random seed i manually input, i get this error:

stable-audio-tools\venv\lib\site-packages\torch\amp\autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling

How can i make this recognize my CUDA gpu?

Fail to train a simple model, Is there any tune config and corpus for beginer

hi,

Hello, I am a beginner who has just started working on this project. I tried training an audio diffusion model with 1000 audio samples, but even after several hundred epochs of repeated iterations, the generated audio from the demo is strange noise. I also attempted to use a DAE model with the same training method, but the model did not converge and the generated audio was also strange.

Can you kindly provide some basic open-source datasets, training scripts, and model configurations that can quickly train a small model for experimental verification? This would be very helpful for us beginners.

Thank you very much. Below is my experimental configuration.

1.  dac
{
    "model_type": "autoencoder",
    "sample_size": 32768,
    "sample_rate": 44100,
    "audio_channels": 1,
    "model": {
        "encoder": {
            "type": "dac",
            "config": {
                "latent_dim": 64,
                "d_model": 32,
                "strides": [4, 4, 8, 8]
            }
        },
        "decoder": {
            "type": "dac",
            "config": {
                "latent_dim": 32,
                "channels": 512,
                "rates": [8, 4, 8, 4]
            }
        },
        "bottleneck": {
            "type": "vae"
        },
        "latent_dim": 32,
        "downsampling_ratio": 1024,
        "io_channels": 1
    },
    "training": {
        "learning_rate": 1e-4,
        "warmup_steps": 0,
        "use_ema": false,
        "loss_configs": {
            "discriminator": {
                "type": "encodec",
                "config": {
                    "filters": 32,
                    "n_ffts": [2048, 1024, 512, 256, 128, 64, 32],
                    "hop_lengths": [512, 256, 128, 64, 32, 16, 8],
                    "win_lengths": [2048, 1024, 512, 256, 128, 64, 32]
                },
                "weights": {
                    "adversarial": 0.1,
                    "feature_matching": 5.0
                }
            },
            "spectral": {
                "type": "mrstft",
                "config": {
                    "fft_sizes": [2048, 1024, 512, 256, 128, 64, 32],
                    "hop_sizes": [512, 256, 128, 64, 32, 16, 8],
                    "win_lengths": [2048, 1024, 512, 256, 128, 64, 32],
                    "perceptual_weighting": true
                },
                "weights": {
                    "mrstft": 1.0
                }
            },
            "time": {
                "type": "l1",
                "weights": {
                    "l1": 0.0
                }
            }
        },
        "demo": {
            "demo_every": 2000
        }
    }
}

2. dancedm
{
    "model_type": "diffusion_uncond",
    "sample_size": 65536,
    "sample_rate": 48000,
    "model": {
        "type": "DAU1d",
        "config": {
            "n_attn_layers": 4,
            "depth": 14,
            "channels": [32, 32, 64, 64, 128, 128, 128, 128, 128, 128, 128, 128, 256, 256]
        }
    },
    "training": {
        "learning_rate": 1e-4,
        "demo": {
            "demo_every": 2000,
            "demo_steps": 250
        }
    }
}

the generated audio file:
image

Windows gradio UI issue - ImportError: attempted relative import with no known parent package

I change into the stable-audio-tools\stable_audio_tools\interface\ directory and run python gradio.py
Gives this error

D:\Tests\Stable Audio Tools\stable-audio-tools\stable_audio_tools\interface>python gradio.py
Traceback (most recent call last):
  File "D:\Tests\Stable Audio Tools\stable-audio-tools\stable_audio_tools\interface\gradio.py", line 3, in <module>
    import gradio as gr
  File "D:\Tests\Stable Audio Tools\stable-audio-tools\stable_audio_tools\interface\gradio.py", line 14, in <module>
    from ..inference.generation import generate_diffusion_cond, generate_diffusion_uncond
ImportError: attempted relative import with no known parent package

Something with the .. reference does not work on Windows?
The inference folder stable-audio-tools\stable_audio_tools\inference does exist.

missing musicgen

Hi dear:
this is good job. Thank you for your open source work
The model type includes musicgen, but the model and training files are missing the musicgen module. Can the files be completed?

adding comments on train.py to make explicit for beginners

Hello!
I'm a beginner in terms of building and training models. I thought that it might be useful to add #comments on train.py on what each component of the code does for easier understanding, especially for people starting out. Here's an example:
image

If useful, I can do a PR.

Thank you!

Request: Training config repository

Hello,

Thanks for open sourcing this framework!

I'm thinking it could be quite valuable to the community to create further .config repositories, with further documentation.

For example, for someone who would like to train a DAC VAE in stereo, it could be valuable to understand how to modify the existing config, especially if it's for beginners. Maybe a discussion would be a great addition, too, to share results and insights.

Best,
M

`Trainer.fit` stopped: No training batches.

Just a quick question regarding getting a model trained.

I have worked through the dependency issues and finally have it running however when using the sample musicgen_small_finetune.json configuration and a dataset configuration that points to a local folder i cannot seem to get it to begin training.

I end up simply getting the following:

Command:

python3 ./train.py --dataset-config ./dataset.conf --model-config ./model.conf --name maybe --strategy auto --save-dir ./save --num-gpus 1
wandb: Currently logged in as: user. Use `wandb login --relogin` to force relogin
wandb: wandb version 0.15.12 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.15.4
wandb: Run data is saved locally in ./wandb/run-20231013_235207-9n7ur241
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run hearty-oath-10
wandb: ⭐️ View project at https://wandb.ai/user/maybe
wandb: 🚀 View run at https://wandb.ai/user/maybe/runs/9n7ur241
wandb: logging graph, to disable use `wandb.watch(log_graph=False)`
/home/server/miniconda3/envs/stableaudio/lib/python3.10/site-packages/lightning_fabric/connector.py:554: UserWarning: 16 is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead!
  rank_zero_warn(
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA GeForce RTX 4090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name        | Type                          | Params
--------------------------------------------------------------
0 | lm          | LMModel                       | 420 M 
1 | lm_ema      | EMA                           | 840 M 
2 | cfg_dropout | ClassifierFreeGuidanceDropout | 0     
--------------------------------------------------------------
420 M     Trainable params
420 M     Non-trainable params
840 M     Total params
3,362.972 Total estimated model params size (MB)
/home/server/miniconda3/envs/stableaudio/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py:103: UserWarning: Total length of `CombinedLoader` across ranks is zero. Please make sure this was your intention.
  rank_zero_warn(
`Trainer.fit` stopped: No training batches.
wandb: Waiting for W&B process to finish... (success).
wandb: 🚀 View run hearty-oath-10 at: https://wandb.ai/user/maybe/runs/9n7ur241
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20231013_235207-9n7ur241/logs

Am i missing something within my configs?

CUDA out of memory training oobleck?

I've been very naively trying to piece together a training config for Oobleck, based partly on the autoencoders doc file and the dac_2048_32_vae.json example config. This is what I have now (which is very likely to be wrong!):


{
    "model_type": "autoencoder",
    "sample_size": 262144,
    "sample_rate": 48000,
    "audio_channels": 2,
    "model": {
        "encoder": {
            "type": "oobleck",
            "config": {
                "in_channels": 2,
                "channels": 128,
                "c_mults": [1, 2, 4, 8],
                "strides": [2, 4, 8, 8],
                "latent_dim": 128,
                "use_snake": true
            }
        },
        "decoder": {
            "type": "oobleck",
            "config": {
                "out_channels": 2,
                "channels": 128,
                "c_mults": [1, 2, 4, 8],
                "strides": [2, 4, 8, 8],
                "latent_dim": 64,
                "use_snake": true,
                "use_nearest_upsample": false
            }
        },
        "bottleneck": {
            "type": "l2_norm"
        },
        "latent_dim": 32,
        "downsampling_ratio": 2048,
        "io_channels": 1
    },
    "training": {
        "learning_rate": 1e-4,
        "warmup_steps": 0,
        "use_ema": false,
        "batch_size": 2,
        "loss_configs": {
            "discriminator": {
                "type": "encodec",
                "config": {
                    "filters": 32,
                    "n_ffts": [2048, 1024, 512, 256, 128, 64, 32],
                    "hop_lengths": [512, 256, 128, 64, 32, 16, 8],
                    "win_lengths": [2048, 1024, 512, 256, 128, 64, 32]
                },
                "weights": {
                    "adversarial": 0.1,
                    "feature_matching": 5.0
                }
            },
            "spectral": {
                "type": "mrstft",
                "config": {
                    "fft_sizes": [2048, 1024, 512, 256, 128, 64, 32],
                    "hop_sizes": [512, 256, 128, 64, 32, 16, 8],
                    "win_lengths": [2048, 1024, 512, 256, 128, 64, 32],
                    "perceptual_weighting": true
                },
                "weights": {
                    "mrstft": 1.0
                }
            },
            "time": {
                "type": "l1",
                "weights": {
                    "l1": 0.0
                }
            }
        },
        "demo": {
            "demo_every": 2000
        }
    }
}

Any suggestions? I think I initially tried removing the discriminator, but hit errors and added it back in.

When use s3_wds_example.json to download dataset: No such file or directory: 'aws'

I use the following command to train the model:

python ./train.py --dataset-config stable_audio_tools/configs/dataset_configs/s3_wds_example.json --model-config stable_audio_tools/configs/model_configs/txt2audio/stable_audio_1_0.json --name harmonai_train

Trying to download dataset from AWS S3, but raised the following error:

FileNotFoundError: [Errno 2] No such file or directory: 'aws'

Thanks for any help!
image

Training issue - ckpt and state_dict dont seem to match

running into issues on training. It seems to work and find all my samples etc but I get an error that comes up when it tries to actually load the ckpt file

traceback is below. I cut it off but it just liusts tons of tensors / layers so somehow it's not matching?

Traceback (most recent call last):
File "train.py", line 128, in
main()
File "train.py", line 63, in main
model.pretransform.load_state_dict(load_ckpt_state_dict(args.pretransform_ckpt_path))
File "H:\stable_audio_train\stable-audio-tools\stable_audio_tools\models\pretransforms.py", line 90, in load_state_dict
self.model.load_state_dict(state_dict, strict=strict)
File "C:\Users\josh4\anaconda3\envs\mytestenv\lib\site-packages\torch\nn\modules\module.py", line 2189, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for AudioAutoencoder:
Missing key(s) in state_dict: "encoder.layers.0.bias", "encoder.layers.0.weight_g", "encoder.layers.0.weight_v", "encoder.layers.1.layers.0.layers.0.alpha", "encoder.layers.1.layers.0.layers.0.beta", "encoder.layers.1.layers.0.layers.1.bias", "encoder

Which CLAP model to use?

Hello,

To use a text-to-audio model, I need a CLAP model. In the example model config, it's called with:

json
Copy code
"config": {
"audio_model_type": "HTSAT-base",
"enable_fusion": true,
"clap_ckpt_path": "/path/to/clap.ckpt",
"use_text_features": true,
"feature_layer_ix": -2
}
So it's calling "clap.ckpt", but I could only find some .bin or .pt models online and tried bouth of them. Is this a problem? What model did you use? I am asking because my training.py is not working with this, and I'm trying to figure out if this is the problem.

thanks :)

Loss about vae and diffusion

I am training VAE and stable audio2 models from scratch, how much will VAE and Diffusion loss reach?My current VAE loss is about 0.4, and diffusion’s mse loss is 0.53.

Training own model just ends in noise

Hello,
I just tried to multiply times training my own model with prompt. I used over 1000 audio files (between 10–30 seconds) but even after 10000 steps of training it just the same noise as an empty model. Anyone an idea why? I used the standard configs for the most time.

Modelconfig

{

"model_type": "diffusion_cond",
"sample_size": 4194304,
"sample_rate": 44100,
"audio_channels": 2,
"model": {
    "pretransform": {
        "type": "autoencoder",
        "iterate_batch": true,
        "config": {
            "encoder": {
                "type": "dac",
                "config": {
                    "in_channels": 2,
                    "latent_dim": 128,
                    "d_model": 128,
                    "strides": [4, 4, 8, 8]
                }
            },
            "decoder": {
                "type": "dac",
                "config": {
                    "out_channels": 2,
                    "latent_dim": 64,
                    "channels": 1536,
                    "rates": [8, 8, 4, 4]
                }
            },
            "bottleneck": {
                "type": "vae"
            },
            "latent_dim": 64,
            "downsampling_ratio": 1024,
            "io_channels": 2
        }
    },
        "conditioning": {
            "configs": [
                {
                    "id": "prompt",
                    "type": "clap_text",
                    "config": {
                        "audio_model_type": "HTSAT-base",
                        "enable_fusion": true,
                        "clap_ckpt_path": "stable-audio-tools/clapmodel/music_audioset_epoch_15_esc_90.14.pt",
                        "use_text_features": true,
                        "feature_layer_ix": -2
                    }
            },
            {
                "id": "seconds_start",
                "type": "int",
                "config": {
                    "min_val": 0,
                    "max_val": 512
                }
            },
            {
                "id": "seconds_total",
                "type": "int",
                "config": {
                    "min_val": 0,
                    "max_val": 512
                }
            }
        ],
        "cond_dim": 768
    },
    "diffusion": {
        "type": "adp_cfg_1d",
        "cross_attention_cond_ids": ["prompt", "seconds_start", "seconds_total"],
        "config": {
            "in_channels": 64,
            "context_embedding_features": 768,
            "context_embedding_max_length": 79,
            "channels": 256,
            "resnet_groups": 16,
            "kernel_multiplier_downsample": 2,
            "multipliers": [4, 4, 4, 5, 5],
            "factors": [1, 2, 2, 4],
            "num_blocks": [2, 2, 2, 2],
            "attentions": [1, 3, 3, 3, 3],
            "attention_heads": 16,
            "attention_multiplier": 4,
            "use_nearest_upsample": false,
            "use_skip_scale": true,
            "use_context_time": true
        }
    },
    "io_channels": 64
},
"training": {
    "learning_rate": 4e-5,
    "demo": {
        "demo_every": 2500,
        "demo_steps": 250,
        "num_demos": 4,
        "demo_cond": [
            {"prompt": "80s style Whipcrack snare drum loop, 120BPM, retro, funk, energetic, nostalgia, dance, disco", "seconds_start": 0, "seconds_total": 30},
            {"prompt": "Dubstep style drum loop 5, 110BPM, fast, energy, meditative, spiritual, zen, calming, focus, introspection", "seconds_start": 0, "seconds_total": 30},
            {"prompt": "Dance Pop style Synth Chorus loop, fast, energy, fun, lively, upbeat, catchy, indie, fresh, vibrant", "seconds_start": 0, "seconds_total": 30},
            {"prompt": "Hip hop style piano loop, 150BPM, Key A, joyful, playful, funny, upbeat, lively, cheerful, entertaining", "seconds_start": 0, "seconds_total": 30}
        ],
        "demo_cfg_scales": [3, 6, 9]
    }
}

}

as an clap model i tried the "https://huggingface.co/laion/clap-htsat-fused" the "https://huggingface.co/laion/larger_clap_music/" and the "https://huggingface.co/hustcw/clap-text/tree/main"

dataset_config:

{
"dataset_type": "audio_dir",
"datasets": [
{
"id": "my_audio",
"path": "stable-audio-tools/trainingdata",
"random_crop": true
}
],
"custom_metadata_module": "stable-audio-tools/custom_metadata.py"
}

and my custom dataset.py

`import os

def get_custom_metadata(info, audio):
# Der Pfad zur Audiodatei
audio_path = info["relpath"]

# Der Name der Audiodatei ohne Erweiterung
audio_name = os.path.splitext(os.path.basename(audio_path))[0]

# Geben Sie den Namen der Audiodatei als "prompt" zurück
return {"prompt": audio_name}

`

Also, the demo sound and every other sound i generated goes like 6 minutes and 20 seconds. WHY? It should only go like 30 seconds.

Anyone an idea? Thanks! :)

[Bug] Segmentation fault trying to generate with `stable-audio-open-1.0`

Hello! Thank you for releasing the audio model.

I downloaded the model locally and ran it with

python run_gradio.py  --model-config /path/to/model_config.json --ckpt-path /path/to/model.ckpt

The gradio sets up, however, when trying to generate, it just segmentation faults.

Running on public URL: <redacted>
This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)
Prompt: a hip hop song
1436392480
  0%|                                                                                                                            | 0/100 [00:00<?, ?it/s]
Segmentation fault

Env: conda with latest stable torch cuda 12.1. L4 24GB GPU.

Was anyone able to generate?

Output directory and filename for the gradio UI

Rather than saving each output.wav under a random folder under C:/Users/UserName/AppData/Local/Temp/gradio/ could there be an argument/parameter for the output directory?

python run_gradio.py --ckpt-path ".\ckpt\model.ckpt" --model-config ".\ckpt\model_config.json" --output_directory "d:\blah\"

Name the output wavs based on the prompt. But if you do this make sure to check for directory/filename length less that 256 chars for Windows and auto-trim the name length if needed. Users love using long prompts.

That would allow all the wavs generated to be in one directory the user specifies and have identifiable names.

Code comments

Hello, as a newcomer to this codebase it would be nice if the code had comments in it explaining what's happening. It's abstracted to oblivion just like the SGM codebase is, where everything is difficult to find. There seems to be little-to-no value to this level of abstraction, but code comments would at least make it less painful.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.