zhvng / open-musiclm Goto Github PK

View Code? Open in Web Editor NEW

500.0 16.0 59.0 1.83 MB

Implementation of MusicLM, a text to music model published by Google Research, with a few modifications.

Home Page: https://arxiv.org/abs/2301.11325

License: MIT License

Python 90.90% Shell 0.22% Jupyter Notebook 8.88%

artificial-intelligence attention music-generation transformer text-to-music

open-musiclm's Introduction

Open MusicLM

Pytorch implementation of MusicLM, a SOTA text to music model published by Google, with a few modifications. We use CLAP as a replacement for MuLan, Encodec as a replacement for SoundStream, and MERT as a replacement for w2v-BERT.

Why CLAP?

CLAP is a joint audio-text model trained on LAION-Audio-630K. Similar to MuLan, it consists of an audio tower and a text tower that project their respective media onto a shared latent space (512 dimensions in CLAP vs 128 dimensions in MuLan).

MuLan was trained on 50 million text-music pairs. Unfortunately I don't have the data to replicate this, so I'm relying on CLAP's pretrained checkpoints to come close. CLAP was trained on 2.6 million total text-audio pairs from LAION-630k (~633k text-audio pairs) and AudioSet (2 million samples with captions generated by a keyword-to-caption model). Although this is a fraction of the data used to train MuLan, we have successfully used CLAP to generate diverse music samples, which you can listen to here (keep in mind these are very early results). In the event that CLAP's latent space is not expressive enough for music generation, we can train CLAP on music or substitute the model for @lucidrain's MuLan implementation once it is trained.

Why Encodec?

SoundStream and Encodec are both neural audio codecs that encode any waveform to a sequence of acoustic tokens, which can then be decoded into a waveform resembling the original. These intermediate tokens can then be modeled as a seq2seq task. Encodec is released by Facebook and pretrained checkpoints are publicly available, whereas this is not the case with SoundStream.

Differences from @lucidrains implementation

Autoregressively models the CLAP/MuLan conditioning signal by passing it into the transformers as discrete tokens, as mentioned in section 3.1 of the paper. Musiclm-pytorch conditions on them with cross attention.
TokenConditionedTransformer can support variable token sequences, which makes it easy to do further experimentation (e.g. combining multiple conditioning signals, stereo waveform generation, etc.)
Uses existing open source models instead of training MuLan and SoundStream.
Some modifications to increase the chance of successfully training the model.

End Goal

The goal of this project is to replicate the results of MusicLM as quickly as possible without necessarily sticking to the architecture in the paper. For those looking for a more true-to-form implementation, check out musiclm-pytorch.

We also seek to gain a better understanding of CLAP's latent space.

Join us on discord if you'd like to get involved!

Usage

Install

conda env create -f environment.yaml
conda activate open-musiclm

Configs

A "model config" contains information about the model architecture such as the number of layers, number of quantizers, target audio lengths for each stage, etc. It is used to instantiate the model during training and inference.

A "training config" contains hyperparameters for training the model. It is used to instantiate the trainer classes during training.

See the ./configs directory for example configs.

Training

CLAP RVQ

The first step is to train the residual vector quantizer that maps continuous CLAP embeds to a discrete token sequence.

python ./scripts/train_clap_rvq.py \
    --results_folder ./results/clap_rvq \ # where to save results and checkpoints
    --model_config ./configs/model/musiclm_small.json \ # path to model config
    --training_config ./configs/training/train_musiclm_fma.json # path to training config

Hubert K-means

Next, we learn a K-means layer that we use to quantize our MERT embeddings into semantic tokens.

python ./scripts/train_hubert_kmeans.py \
    --results_folder ./results/hubert_kmeans \ # where to save results and checkpoints
    --model_config ./configs/model/musiclm_small.json \
    --training_config ./configs/training/train_musiclm_fma.json

Semantic Stage + Coarse Stage + Fine Stage

Once we have a working K-means and RVQ, we can now train the semantic, coarse and fine stages. These stages can be trained concurrently.

python ./scripts/train_semantic_stage.py \
    --results_folder ./results/semantic \ # where to save results and checkpoints
    --model_config ./configs/model/musiclm_small.json \
    --training_config ./configs/training/train_musiclm_fma.json \
    --rvq_path PATH_TO_RVQ_CHECKPOINT \ # path to previously trained rvq
    --kmeans_path PATH_TO_KMEANS_CHECKPOINT # path to previously trained kmeans

python ./scripts/train_coarse_stage.py \
    --results_folder ./results/coarse \ # where to save results and checkpoints
    --model_config ./configs/model/musiclm_small.json \
    --training_config ./configs/training/train_musiclm_fma.json \
    --rvq_path PATH_TO_RVQ_CHECKPOINT \ # path to previously trained rvq
    --kmeans_path PATH_TO_KMEANS_CHECKPOINT # path to previously trained kmeans

python ./scripts/train_fine_stage.py \
    --results_folder ./results/fine \ # where to save results and checkpoints
    --model_config ./configs/model/musiclm_small.json \
    --training_config ./configs/training/train_musiclm_fma.json \
    --rvq_path PATH_TO_RVQ_CHECKPOINT \ # path to previously trained rvq
    --kmeans_path PATH_TO_KMEANS_CHECKPOINT # path to previously trained kmeans

Preprocessing

In the above case, we are using CLAP, Hubert and Encodec to generate clap, semantic and acoustic tokens live during training. However, these models take up space on the GPU, and it is inefficient to recompute these tokens if we're making multiple runs on the same data. We can instead compute these tokens ahead of time and iterate over them during training.

To do this, fill in the data_preprocessor_cfg field in the config and set use_preprocessed_data to True in the trainer configs (look at train_fma_preprocess.json for inspiration). Then run the following to preprocess the dataset, followed by your training script.

python ./scripts/preprocess_data.py \
    --model_config ./configs/model/musiclm_small.json \
    --training_config ./configs/training/train_fma_preprocess.json \
    --rvq_path PATH_TO_RVQ_CHECKPOINT \ # path to previously trained rvq
    --kmeans_path PATH_TO_KMEANS_CHECKPOINT # path to previously trained kmeans

Inference

Generate multiple samples and use CLAP to select the best ones:

python scripts/infer_top_match.py \
    "your text prompt"
    --num_samples 4                                 # number of samples to generate
    --num_top_matches 1                             # number of top matches to return
    --semantic_path PATH_TO_SEMANTIC_CHECKPOINT \   # path to previously trained semantic stage
    --coarse_path PATH_TO_COARSE_CHECKPOINT \       # path to previously trained coarse stage
    --fine_path PATH_TO_FINE_CHECKPOINT \           # path to previously trained fine stage
    --rvq_path PATH_TO_RVQ_CHECKPOINT \             # path to previously trained rvq
    --kmeans_path PATH_TO_KMEANS_CHECKPOINT         # path to previously trained kmeans
    --model_config ./configs/model/musiclm_small.json \
    --duration 4

Generate samples for various test prompts:

python scripts/infer.py \
    --semantic_path PATH_TO_SEMANTIC_CHECKPOINT \   # path to previously trained semantic stage
    --coarse_path PATH_TO_COARSE_CHECKPOINT \       # path to previously trained coarse stage
    --fine_path PATH_TO_FINE_CHECKPOINT \           # path to previously trained fine stage
    --rvq_path PATH_TO_RVQ_CHECKPOINT \             # path to previously trained rvq
    --kmeans_path PATH_TO_KMEANS_CHECKPOINT         # path to previously trained kmeans
    --model_config ./configs/model/musiclm_small.json \
    --duration 4

You can use the --return_coarse_wave flag to skip the fine stage and reconstruct audio from coarse tokens alone.

Checkpoints

You can download experimental checkpoints for the musiclm_large_small_context model here. To fine tune the model, call the train scripts with the --fine_tune_from flag.

Thank you

Okio for providing compute to train the model! Okio is a startup that is developing Nendo - an open source generative music tool-suite that re-imagines music. If you're interested check them out at okio.ai
@lucidrains for the audiolm-pytorch implementation. This repo contains a refactored version of a lot of the code in audiolm-pytorch.
LAION for CLAP
Music Audio Pretrain team for MERT

Citations

@inproceedings{Agostinelli2023MusicLMGM,
    title     = {MusicLM: Generating Music From Text},
    author    = {Andrea Agostinelli and Timo I. Denk and Zal{\'a}n Borsos and Jesse Engel and Mauro Verzetti and Antoine Caillon and Qingqing Huang and Aren Jansen and Adam Roberts and Marco Tagliasacchi and Matthew Sharifi and Neil Zeghidour and C. Frank},
    year      = {2023}
}

@article{wu2022large,
  title     = {Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation},
  author    = {Wu, Yusong and Chen, Ke and Zhang, Tianyu and Hui, Yuchen and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo},
  journal   = {arXiv preprint arXiv:2211:06687},
  year      = {2022},
}

@article{defossez2022highfi,
  title     = {High Fidelity Neural Audio Compression},
  author    = {Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
  journal   = {arXiv preprint arXiv:2210.13438},
  year      = {2022}
}

@misc{li2023mert,
  title     = {MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training}, 
  author    = {Yizhi Li and Ruibin Yuan and Ge Zhang and Yinghao Ma and Xingran Chen and Hanzhi Yin and Chenghua Lin and Anton Ragni and Emmanouil Benetos and Norbert Gyenge and Roger Dannenberg and Ruibo Liu and Wenhu Chen and Gus Xia and Yemin Shi and Wenhao Huang and Yike Guo and Jie Fu},
  year      = {2023},
  eprint    = {2306.00107},
  archivePrefix = {arXiv},
  primaryClass  = {cs.SD}
}

open-musiclm's People

Contributors

Stargazers

Watchers

Forkers

barneyhill zero506 smoothlanding chenchy distinctvision federicovisi youknowwyz g33kfleek wyhauyeung whitefu shaun95 ishine ghlee3401 liujingxiu23 adarshdotexe pexavc mioooooer asanyhc benedictusalvian barbosart bluenucleus soundverse samsgates jlalmes aaronlsmiles thomaslee-git cephdon xing1p freeabt ken2190 techthiyanes pan310 star3lord wildownes xzm2004260 carlwangchina suenchunhui michaelalong mani-rri linkessence benmacrae research-clone sunnnnnnnny acacaaa wwuoti jackwaterveg chill4stev jindl465 benbowler steveefemsc road2018 ricoroma-coder bwl11 brunoscaglione boringtaskai alexrisman phantomcorn mishav78 princetrunks

open-musiclm's Issues

Is there too little data for clustering？

it's only about 80 hours，I'm not sure if the actual training was changed

"hubert_kmeans_trainer_cfg": {
"folder": "./data/fma_large",
"feature_extraction_num_steps": 320,
"feature_extraction_batch_size": 32
},

Pretrained models

Hi there, I want to start off by saying amazing work replicating MusicLM so quickly!

I'm trying to get this to run on my local device, but it seems like that involves downloading the 100Gb fma_large dataset, followed by training all the various stages.
Would it be possible for you to upload the pretrained models in the results folders so everyone can try this out quickly?

empty data_processor

Hi,

For some reason, the data processor return empty in preprocess.py:
inputs = next(self.dl_iter)

inputs is empty, however, files.append(file) in data.py are okay, all files in data/fma_large are appended correctly.

Thank you for your help.

Can you share the CLAP RVQ checkpoints?

Hi,
It is a amazing work. Thanks for share your code. I want to ask whether you can share the checkpoint about the CLAP RVQ model. I want to use it for the following work.

loading clap checkpoint

Hi, thank you for amazing work.

I tried to use clap checkpoint: music_speech_audioset_epoch_15_esc_89.98.pt
solved using HTSAT-base

conda env create -f environment.yaml fails in fresh conda install

I'm using windows10, with a fresh anaconda install.

conda env create -f environment.yaml fails, but the fix is simple: change "sklearn" to "scikit-learn" in the yaml file.

c:\projects\open-musiclm>conda env create -f environment.yaml
Collecting package metadata (repodata.json): done
Solving environment: done

Downloading and Extracting Packages

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Installing pip dependencies: | Ran pip subprocess with arguments:
['C:\Users\Michael\anaconda3\envs\open-musiclm\python.exe', '-m', 'pip', 'install', '-U', '-r', 'c:\projects\open-musiclm\condaenv.388f_48u.requirements.txt', '--exists-action=b']
Pip subprocess output:
Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==2.0.0+cu117
Downloading https://download.pytorch.org/whl/cu117/torch-2.0.0%2Bcu117-cp310-cp310-win_amd64.whl (2343.6 MB)
---------------------------------------- 2.3/2.3 GB ? eta 0:00:00
Collecting torchvision==0.15.1+cu117
Downloading https://download.pytorch.org/whl/cu117/torchvision-0.15.1%2Bcu117-cp310-cp310-win_amd64.whl (4.9 MB)
---------------------------------------- 4.9/4.9 MB 7.3 MB/s eta 0:00:00
Collecting torchaudio==2.0.1+cu117
Downloading https://download.pytorch.org/whl/cu117/torchaudio-2.0.1%2Bcu117-cp310-cp310-win_amd64.whl (2.5 MB)
---------------------------------------- 2.5/2.5 MB 17.4 MB/s eta 0:00:00
Collecting einops>=0.4
Downloading einops-0.6.1-py3-none-any.whl (42 kB)
-------------------------------------- 42.2/42.2 kB 146.8 kB/s eta 0:00:00
Collecting vector-quantize-pytorch>=0.10.15
Downloading vector_quantize_pytorch-1.4.1-py3-none-any.whl (11 kB)
Collecting librosa==0.10.0
Downloading librosa-0.10.0-py3-none-any.whl (252 kB)
-------------------------------------- 252.9/252.9 kB 1.7 MB/s eta 0:00:00
Collecting torchlibrosa==0.1.0
Downloading torchlibrosa-0.1.0-py3-none-any.whl (11 kB)
Collecting ftfy
Using cached ftfy-6.1.1-py3-none-any.whl (53 kB)
Collecting tqdm
Using cached tqdm-4.65.0-py3-none-any.whl (77 kB)
Collecting transformers
Downloading transformers-4.29.1-py3-none-any.whl (7.1 MB)
---------------------------------------- 7.1/7.1 MB 16.1 MB/s eta 0:00:00
Collecting encodec==0.1.1
Downloading encodec-0.1.1.tar.gz (3.7 MB)
---------------------------------------- 3.7/3.7 MB 11.4 MB/s eta 0:00:00
Preparing metadata (setup.py): started
Preparing metadata (setup.py): finished with status 'done'
Collecting gdown
Using cached gdown-4.7.1-py3-none-any.whl (15 kB)
Collecting accelerate>=0.17.0
Downloading accelerate-0.19.0-py3-none-any.whl (219 kB)
-------------------------------------- 219.1/219.1 kB 3.3 MB/s eta 0:00:00
Collecting beartype
Downloading beartype-0.14.0-py3-none-any.whl (720 kB)
-------------------------------------- 720.2/720.2 kB 9.1 MB/s eta 0:00:00
Collecting joblib
Downloading joblib-1.2.0-py3-none-any.whl (297 kB)
-------------------------------------- 298.0/298.0 kB 3.7 MB/s eta 0:00:00
Collecting h5py
Downloading h5py-3.8.0-cp310-cp310-win_amd64.whl (2.6 MB)
---------------------------------------- 2.6/2.6 MB 16.8 MB/s eta 0:00:00
Collecting sklearn
Downloading sklearn-0.0.post5.tar.gz (3.7 kB)
Preparing metadata (setup.py): started
Preparing metadata (setup.py): finished with status 'error'

Pip subprocess error:
/ error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [18 lines of output]
The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
rather than 'sklearn' for pip commands.

  Here is how to fix this error in the main use cases:
  - use 'pip install scikit-learn' rather than 'pip install sklearn'
  - replace 'sklearn' by 'scikit-learn' in your pip requirements files
    (requirements.txt, setup.py, setup.cfg, Pipfile, etc ...)
  - if the 'sklearn' package is used by one of your dependencies,
    it would be great if you take some time to track which package uses
    'sklearn' instead of 'scikit-learn' and report it to their issue tracker
  - as a last resort, set the environment variable
    SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True to avoid this error

  More information is available at
  https://github.com/scikit-learn/sklearn-pypi-package

  If the previous advice does not cover your use case, feel free to report it at
  https://github.com/scikit-learn/sklearn-pypi-package/issues/new
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

failed

CondaEnvException: Pip failed

Hubert args normalization

I looked at the MERT example and noticed that it's actually preprocessing the input. The code looks really convoluted but in the case of batch size 1, the net effect is that it's normalizing things to have zero mean and unit variance instead of passing it in directly.

Note that you can use the processor provided in the example if you want, but I decided not to for my case because my wav_input is already in CUDA and transformers forces everything into numpy and therefore CPU 😞 , resulting in expensive copies as you move data back and forth. Wasn't sure which one you've got here :))

Only generate a melody and not a song with lyrics?

I trained with different prompt with your pretrained model. I felt that it could only generate melody, but could not generate songs with lyrics.

train loss of semantic stage has problem

when train the semantic stage, the loss becomes very large at more than 2000 steps，I don't know what is causing this problem， do I need to adjust my training strategy?

Error when trying to train CLAP

$ python train_clap_rvq.py

found ignored file, skipping
found ignored file, skipping
found ignored file, skipping
found ignored file, skipping
found ignored file, skipping
found ignored file, skipping
training with dataset of 7594 samples and validating with randomly splitted 400 samples
Traceback (most recent call last):
  File "/Users/akhiltolani/Desktop/open-musiclm-main/scripts/train_clap_rvq.py", line 37, in <module>
    trainer.train()
  File "/Users/akhiltolani/Desktop/open-musiclm-main/scripts/../open_musiclm/trainer.py", line 507, in train
    logs = self.train_step()
  File "/Users/akhiltolani/Desktop/open-musiclm-main/scripts/../open_musiclm/trainer.py", line 473, in train_step
    raw_wave_for_clap = next(self.dl_iter)[0]
IndexError: tuple index out of range

training data

do we need to prepare training data?

The question about training CLAP RVQ

It seems that in the ClapRVQTrainer code, you donot use any gradient backward? How to understood this?

RuntimeError: Error(s) in loading state_dict for TokenConditionedTransformer

I just run the infer file and got this eror:

Traceback (most recent call last):
  File "/workspace/OPEN-MUSICLM/scripts/infer.py", line 66, in <module>
    musiclm = create_musiclm_from_config(
  File "/workspace/OPEN-MUSICLM/scripts/../open_musiclm/config.py", line 442, in create_musiclm_from_config
    semantic_transformer = create_semantic_transformer_from_config(model_config, semantic_path, device)
  File "/workspace/OPEN-MUSICLM/scripts/../open_musiclm/config.py", line 258, in create_semantic_transformer_from_config
    load_model(transformer, checkpoint_path)
  File "/workspace/OPEN-MUSICLM/scripts/../open_musiclm/config.py", line 204, in load_model
    model.load_state_dict(pkg)
  File "/home/user/miniconda/envs/open-musiclm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for TokenConditionedTransformer:
	Unexpected key(s) in state_dict: "transformer.layers.6.0.q_scale", "transformer.layers.6.0.k_scale", "transformer.layers.6.0.norm.gamma", "transformer.layers.6.0.norm.beta", "transformer.layers.6.0.to_q.weight", "transformer.layers.6.0.to_kv.weight", "transformer.layers.6.0.to_out.0.weight", "transformer.layers.6.2.0.gamma", "transformer.layers.6.2.0.beta", "transformer.layers.6.2.1.weight", "transformer.layers.6.2.2.ds_conv.weight", "transformer.layers.6.2.4.gamma", "transformer.layers.6.2.4.beta", "transformer.layers.6.2.6.weight", "transformer.layers.7.0.q_scale", "transformer.layers.7.0.k_scale", "transformer.layers.7.0.norm.gamma", "transformer.layers.7.0.norm.beta", "transformer.layers.7.0.to_q.weight", "transformer.layers.7.0.to_kv.weight", "transformer.layers.7.0.to_out.0.weight", "transformer.layers.7.2.0.gamma", "transformer.layers.7.2.0.beta", "transformer.layers.7.2.1.weight", "transformer.layers.7.2.2.ds_conv.weight", "transformer.layers.7.2.4.gamma", "transformer.layers.7.2.4.beta", "transformer.layers.7.2.6.weight", "transformer.layers.8.0.q_scale", "transformer.layers.8.0.k_scale", "transformer.layers.8.0.norm.gamma", "transformer.layers.8.0.norm.beta", "transformer.layers.8.0.to_q.weight", "transformer.layers.8.0.to_kv.weight", "transformer.layers.8.0.to_out.0.weight", "transformer.layers.8.2.0.gamma", "transformer.layers.8.2.0.beta", "transformer.layers.8.2.1.weight", "transformer.layers.8.2.2.ds_conv.weight", "transformer.layers.8.2.4.gamma", "transformer.layers.8.2.4.beta", "transformer.layers.8.2.6.weight", "transformer.layers.9.0.q_scale", "transformer.layers.9.0.k_scale", "transformer.layers.9.0.norm.gamma", "transformer.layers.9.0.norm.beta", "transformer.layers.9.0.to_q.weight", "transformer.layers.9.0.to_kv.weight", "transformer.layers.9.0.to_out.0.weight", "transformer.layers.9.2.0.gamma", "transformer.layers.9.2.0.beta", "transformer.layers.9.2.1.weight", "transformer.layers.9.2.2.ds_conv.weight", "transformer.layers.9.2.4.gamma", "transformer.layers.9.2.4.beta", "transformer.layers.9.2.6.weight", "transformer.layers.10.0.q_scale", "transformer.layers.10.0.k_scale", "transformer.layers.10.0.norm.gamma", "transformer.layers.10.0.norm.beta", "transformer.layers.10.0.to_q.weight", "transformer.layers.10.0.to_kv.weight", "transformer.layers.10.0.to_out.0.weight", "transformer.layers.10.2.0.gamma", "transformer.layers.10.2.0.beta", "transformer.layers.10.2.1.weight", "transformer.layers.10.2.2.ds_conv.weight", "transformer.layers.10.2.4.gamma", "transformer.layers.10.2.4.beta", "transformer.layers.10.2.6.weight", "transformer.layers.11.0.q_scale", "transformer.layers.11.0.k_scale", "transformer.layers.11.0.norm.gamma", "transformer.layers.11.0.norm.beta", "transformer.layers.11.0.to_q.weight", "transformer.layers.11.0.to_kv.weight", "transformer.layers.11.0.to_out.0.weight", "transformer.layers.11.2.0.gamma", "transformer.layers.11.2.0.beta", "transformer.layers.11.2.1.weight", "transformer.layers.11.2.2.ds_conv.weight", "transformer.layers.11.2.4.gamma", "transformer.layers.11.2.4.beta", "transformer.layers.11.2.6.weight", "transformer.layers.12.0.q_scale", "transformer.layers.12.0.k_scale", "transformer.layers.12.0.norm.gamma", "transformer.layers.12.0.norm.beta", "transformer.layers.12.0.to_q.weight", "transformer.layers.12.0.to_kv.weight", "transformer.layers.12.0.to_out.0.weight", "transformer.layers.12.2.0.gamma", "transformer.layers.12.2.0.beta", "transformer.layers.12.2.1.weight", "transformer.layers.12.2.2.ds_conv.weight", "transformer.layers.12.2.4.gamma", "transformer.layers.12.2.4.beta", "transformer.layers.12.2.6.weight", "transformer.layers.13.0.q_scale", "transformer.layers.13.0.k_scale", "transformer.layers.13.0.norm.gamma", "transformer.layers.13.0.norm.beta", "transformer.layers.13.0.to_q.weight", "transformer.layers.13.0.to_kv.weight", "transformer.layers.13.0.to_out.0.weight", "transformer.layers.13.2.0.gamma", "transformer.layers.13.2.0.beta", "transformer.layers.13.2.1.weight", "transformer.layers.13.2.2.ds_conv.weight", "transformer.layers.13.2.4.gamma", "transformer.layers.13.2.4.beta", "transformer.layers.13.2.6.weight", "transformer.layers.14.0.q_scale", "transformer.layers.14.0.k_scale", "transformer.layers.14.0.norm.gamma", "transformer.layers.14.0.norm.beta", "transformer.layers.14.0.to_q.weight", "transformer.layers.14.0.to_kv.weight", "transformer.layers.14.0.to_out.0.weight", "transformer.layers.14.2.0.gamma", "transformer.layers.14.2.0.beta", "transformer.layers.14.2.1.weight", "transformer.layers.14.2.2.ds_conv.weight", "transformer.layers.14.2.4.gamma", "transformer.layers.14.2.4.beta", "transformer.layers.14.2.6.weight", "transformer.layers.15.0.q_scale", "transformer.layers.15.0.k_scale", "transformer.layers.15.0.norm.gamma", "transformer.layers.15.0.norm.beta", "transformer.layers.15.0.to_q.weight", "transformer.layers.15.0.to_kv.weight", "transformer.layers.15.0.to_out.0.weight", "transformer.layers.15.2.0.gamma", "transformer.layers.15.2.0.beta", "transformer.layers.15.2.1.weight", "transformer.layers.15.2.2.ds_conv.weight", "transformer.layers.15.2.4.gamma", "transformer.layers.15.2.4.beta", "transformer.layers.15.2.6.weight", "transformer.layers.16.0.q_scale", "transformer.layers.16.0.k_scale", "transformer.layers.16.0.norm.gamma", "transformer.layers.16.0.norm.beta", "transformer.layers.16.0.to_q.weight", "transformer.layers.16.0.to_kv.weight", "transformer.layers.16.0.to_out.0.weight", "transformer.layers.16.2.0.gamma", "transformer.layers.16.2.0.beta", "transformer.layers.16.2.1.weight", "transformer.layers.16.2.2.ds_conv.weight", "transformer.layers.16.2.4.gamma", "transformer.layers.16.2.4.beta", "transformer.layers.16.2.6.weight", "transformer.layers.17.0.q_scale", "transformer.layers.17.0.k_scale", "transformer.layers.17.0.norm.gamma", "transformer.layers.17.0.norm.beta", "transformer.layers.17.0.to_q.weight", "transformer.layers.17.0.to_kv.weight", "transformer.layers.17.0.to_out.0.weight", "transformer.layers.17.2.0.gamma", "transformer.layers.17.2.0.beta", "transformer.layers.17.2.1.weight", "transformer.layers.17.2.2.ds_conv.weight", "transformer.layers.17.2.4.gamma", "transformer.layers.17.2.4.beta", "transformer.layers.17.2.6.weight", "transformer.layers.18.0.q_scale", "transformer.layers.18.0.k_scale", "transformer.layers.18.0.norm.gamma", "transformer.layers.18.0.norm.beta", "transformer.layers.18.0.to_q.weight", "transformer.layers.18.0.to_kv.weight", "transformer.layers.18.0.to_out.0.weight", "transformer.layers.18.2.0.gamma", "transformer.layers.18.2.0.beta", "transformer.layers.18.2.1.weight", "transformer.layers.18.2.2.ds_conv.weight", "transformer.layers.18.2.4.gamma", "transformer.layers.18.2.4.beta", "transformer.layers.18.2.6.weight", "transformer.layers.19.0.q_scale", "transformer.layers.19.0.k_scale", "transformer.layers.19.0.norm.gamma", "transformer.layers.19.0.norm.beta", "transformer.layers.19.0.to_q.weight", "transformer.layers.19.0.to_kv.weight", "transformer.layers.19.0.to_out.0.weight", "transformer.layers.19.2.0.gamma", "transformer.layers.19.2.0.beta", "transformer.layers.19.2.1.weight", "transformer.layers.19.2.2.ds_conv.weight", "transformer.layers.19.2.4.gamma", "transformer.layers.19.2.4.beta", "transformer.layers.19.2.6.weight", "transformer.layers.20.0.q_scale", "transformer.layers.20.0.k_scale", "transformer.layers.20.0.norm.gamma", "transformer.layers.20.0.norm.beta", "transformer.layers.20.0.to_q.weight", "transformer.layers.20.0.to_kv.weight", "transformer.layers.20.0.to_out.0.weight", "transformer.layers.20.2.0.gamma", "transformer.layers.20.2.0.beta", "transformer.layers.20.2.1.weight", "transformer.layers.20.2.2.ds_conv.weight", "transformer.layers.20.2.4.gamma", "transformer.layers.20.2.4.beta", "transformer.layers.20.2.6.weight", "transformer.layers.21.0.q_scale", "transformer.layers.21.0.k_scale", "transformer.layers.21.0.norm.gamma", "transformer.layers.21.0.norm.beta", "transformer.layers.21.0.to_q.weight", "transformer.layers.21.0.to_kv.weight", "transformer.layers.21.0.to_out.0.weight", "transformer.layers.21.2.0.gamma", "transformer.layers.21.2.0.beta", "transformer.layers.21.2.1.weight", "transformer.layers.21.2.2.ds_conv.weight", "transformer.layers.21.2.4.gamma", "transformer.layers.21.2.4.beta", "transformer.layers.21.2.6.weight", "transformer.layers.22.0.q_scale", "transformer.layers.22.0.k_scale", "transformer.layers.22.0.norm.gamma", "transformer.layers.22.0.norm.beta", "transformer.layers.22.0.to_q.weight", "transformer.layers.22.0.to_kv.weight", "transformer.layers.22.0.to_out.0.weight", "transformer.layers.22.2.0.gamma", "transformer.layers.22.2.0.beta", "transformer.layers.22.2.1.weight", "transformer.layers.22.2.2.ds_conv.weight", "transformer.layers.22.2.4.gamma", "transformer.layers.22.2.4.beta", "transformer.layers.22.2.6.weight", "transformer.layers.23.0.q_scale", "transformer.layers.23.0.k_scale", "transformer.layers.23.0.norm.gamma", "transformer.layers.23.0.norm.beta", "transformer.layers.23.0.to_q.weight", "transformer.layers.23.0.to_kv.weight", "transformer.layers.23.0.to_out.0.weight", "transformer.layers.23.2.0.gamma", "transformer.layers.23.2.0.beta", "transformer.layers.23.2.1.weight", "transformer.layers.23.2.2.ds_conv.weight", "transformer.layers.23.2.4.gamma", "transformer.layers.23.2.4.beta", "transformer.layers.23.2.6.weight". 
	size mismatch for transformer.layers.0.0.to_q.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([512, 1024]).
	size mismatch for transformer.layers.0.0.to_out.0.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([1024, 512]).
	size mismatch for transformer.layers.1.0.to_q.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([512, 1024]).
	size mismatch for transformer.layers.1.0.to_out.0.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([1024, 512]).
	size mismatch for transformer.layers.2.0.to_q.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([512, 1024]).
	size mismatch for transformer.layers.2.0.to_out.0.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([1024, 512]).
	size mismatch for transformer.layers.3.0.to_q.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([512, 1024]).
	size mismatch for transformer.layers.3.0.to_out.0.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([1024, 512]).
	size mismatch for transformer.layers.4.0.to_q.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([512, 1024]).
	size mismatch for transformer.layers.4.0.to_out.0.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([1024, 512]).
	size mismatch for transformer.layers.5.0.to_q.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([512, 1024]).
	size mismatch for transformer.layers.5.0.to_out.0.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([1024, 512]).
	size mismatch for transformer.rel_pos_bias.net.0.0.weight: copying a param with shape torch.Size([512, 1]) from checkpoint, the shape in current model is torch.Size([1024, 1]).
	size mismatch for transformer.rel_pos_bias.net.0.0.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for transformer.rel_pos_bias.net.1.0.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for transformer.rel_pos_bias.net.1.0.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for transformer.rel_pos_bias.net.2.0.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for transformer.rel_pos_bias.net.2.0.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for transformer.rel_pos_bias.net.3.weight: copying a param with shape torch.Size([16, 512]) from checkpoint, the shape in current model is torch.Size([8

can't train the clap

(open-musiclm) G:\Learn\AmateurLearning\AI\Practice\open-musiclm-main>python ./scripts/train_clap_rvq.py --results_folder ./results/clap_rvq --model_config ./configs/model/musiclm_small.json --training_config ./configs/training/train_musiclm_fma.json
loading clap...
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias']

This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
F:\Miniconda\envs\open-musiclm\Lib\site-packages\torchaudio\transforms_transforms.py:611: UserWarning: Argument 'onesided' has been deprecated and has no influence on the behavior of this module.
warnings.warn(
F:\Miniconda\envs\open-musiclm\Lib\site-packages\accelerate\accelerator.py:258: FutureWarning: logging_dir is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use project_dir instead.
warnings.warn(
F:\Miniconda\envs\open-musiclm\Lib\site-packages\accelerate\accelerator.py:375: UserWarning: log_with=tensorboard was passed but no supported trackers are currently installed.
warnings.warn(f"log_with={log_with} was passed but no supported trackers are currently installed.")
Traceback (most recent call last):
File "G:\Learn\AmateurLearning\AI\Practice\open-musiclm-main\scripts\train_clap_rvq.py", line 33, in
trainer = create_clap_rvq_trainer_from_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\Learn\AmateurLearning\AI\Practice\open-musiclm-main\scripts..\open_musiclm\config.py", line 317, in create_clap_rvq_trainer_from_config
trainer = ClapRVQTrainer(
^^^^^^^^^^^^^^^
File "G:\Learn\AmateurLearning\AI\Practice\open-musiclm-main\scripts..\open_musiclm\trainer.py", line 606, in init
self.ds = SoundDataset(
^^^^^^^^^^^^^
File "G:\Learn\AmateurLearning\AI\Practice\open-musiclm-main\scripts..\open_musiclm\data.py", line 80, in init
assert path.exists(), 'folder does not exist'
AssertionError: folder does not exist @zhvng @jlalmes

Models workings, but audio quality is not good.

Hi Zhvng, I've been able to train all the models, generate the checkpoints and run infer.py. The library is able to output generated audio files now, and it has some "hints" of music, but it sounds like one of the models is broken.

Do you have any ideas or guidance on what I can do to improve the generated audio quality?

Info about the models I've trained for testing -

clap_rvq - 950 steps
Hubert Kmeans - Completed training. Stopped automatically due to a lack of convergence at 60.
Semantic - 6000 steps
Fine - 4000 steps
Coarse - 4000 steps
These models are trained on the fma_large dataset.

To get better audio quality, should I just continue training the models with more training steps, or are there unsolved technical challenges that need to be resolved before I can start retraining?
I've linked the current state of the audio output below.

https://cassetteai.com/generations/gen_0.wav (chirping of birds and the distant echos of bells)
https://cassetteai.com/generations/gen_1.wav (cat meowing)

Installation and Basic Usage Guide

Thank you for implementing this.

Not sure what the prerequisites and models are to get started, would be great to have some more documentation around these topics.

words

does your model generate music with words?

Discord invite link expired

Morning @zhvng, I'm really keen to hear the samples of your early results.

Seems the invite link to your Discord has expired (https://discord.gg/jN8jADShX5). Could you update it? Many thanks.

ClapRVQTrainer does not have log_with

ClapRVQTrainer does not have "log_with" and it causes error in following line

if 'tensorboard' in self.log_with:
self.accelerator.init_trackers(f"clap_rvq_{int(time.time() * 1000)}", config=hps)
else:
self.accelerator.init_trackers(f"clap_rvq", config=hps)

So, I added log_with refer to SingleStageTrainer class and it works fine.

self.log_with = accelerate_kwargs['log_with'] if 'log_with' in accelerate_kwargs else None

Ran the reasoning for hours without stopping

when i run infer_top_match.py , Running for a few hours has not ended, how to control the length of generation?

请问怎么结束生成？运行几个小时依然没有结束生成，没有产生结果

Invalid float value(s)

Ignore this, I had missed a comma after the value. Duh!

how can i train the model or exist pre trained to use

What data did you use for training

I'm not familiar with the music domain, are there any open-source datasets available for use?

empty db, curious errors, empty output, long gen time, outputs noise

I am writing here because the discord invite in the README.md is invalid.

I am not sure I am doing this "right". Using the dataset provided on Google Drive and the prompt "violins playing Tchaikovsky", it takes 10 minutes on an RTX 4070Ti to generate tokens and create a 4-second clip of chaotic humming sounds, and when I make a 30 seconds clip, which takes over an hour to generate tokens, it creates a 3 meg file that sounds like car horns under water :/

Is there a preferred prompt to use with the test data? What sounds were sampled to make the test data?

When I tried to sample my own sounds, after 24 hours, the semantic encoding was less than 10% finished. It is "normal' that it should take 10 days to sample a clip?

Also, using the Google Drive data, and --model_config ./model/musiclm_large_small_context.json I get the errors...

`Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

You are using a model of type mert_model to instantiate a model of type hubert. This is not supported for all configurations of models and can yield errors.

What are the correct settings for using the Google Drive data?

My current command is:

python scripts/infer_top_match.py \
    "violins playing Tchaikovsky" \
    --num_samples 4 \
    --num_top_matches 1 \
    --semantic_path   ./model/semantic.transformer.14000.pt \
    --coarse_path     ./model/coarse.transformer.18000.pt \
    --fine_path       ./model/fine.transformer.24000.pt \
    --rvq_path        ./model/clap.rvq.950_no_fusion.pt \
    --kmeans_path     ./model/kmeans_10s_no_fusion.joblib \
    --model_config    ./model/musiclm_large_small_context.json \
    --duration 4

I had to use the Goggle Drive because the code, while not generating any errors, generated a 0 byte preprocessed.db file in the semantic section, which caused errors in the generation section.

Is there a working example of this code somewhere with proper checkpoints?

Thanks

Question regarding dataset

The Readme states that the CLAP model uses the LAION-Audio-630K dataset, however in the repo I can only find a reference to the FMA (Free Music Archive) dataset. Is there any specific reason for this?