Code Monkey home page Code Monkey logo

audiolm-pytorch's Introduction

AudioLM - Pytorch (wip)

Implementation of AudioLM, a Language Modeling Approach to Audio Generation out of Google Research, in Pytorch

It also extends the work for conditioning with classifier free guidance with T5. This allows for one to do text-to-audio or TTS, not offered in the paper.

Install

$ pip install audiolm-pytorch

Usage

First, SoundStream needs to be trained on a large corpus of audio data

from audiolm_pytorch import SoundStream, SoundStreamTrainer

soundstream = SoundStream(
    codebook_size = 1024,
    rq_num_quantizers = 8,
)

trainer = SoundStreamTrainer(
    soundstream,
    folder = '/path/to/librispeech',
    batch_size = 4,
    data_max_length = 320 * 32,
    num_train_steps = 10000
).cuda()

trainer.train()

Then three separate transformers (SemanticTransformer, CoarseTransformer, FineTransformer) need to be trained

ex. SemanticTransformer

import torch
from audiolm_pytorch import HubertWithKmeans, SemanticTransformer

wav2vec = HubertWithKmeans(
    checkpoint_path = './hubert/hubert_base_ls960.pt',
    kmeans_path = './hubert/hubert_base_ls960_L9_km500.bin'
)

semantic_transformer = SemanticTransformer(
    wav2vec = wav2vec,
    dim = 1024,
    depth = 6
).cuda()

wave = torch.randn(1, 320 * 512).cuda()

loss = semantic_transformer(
    raw_wave = wave,
    return_loss = True
)

loss.backward()

# after much training above

sample = semantic_transformer.generate(max_length = 128) # (1, < 128) - may terminate early if it detects [eos]

ex. CoarseTransformer

import torch
from audiolm_pytorch import HubertWithKmeans, SoundStream, CoarseTransformer, CoarseTransformerWrapper

wav2vec = HubertWithKmeans(
    checkpoint_path = './hubert/hubert_base_ls960.pt',
    kmeans_path = './hubert/hubert_base_ls960_L9_km500.bin'
)

soundstream = SoundStream(
    codebook_size = 1024,
    rq_num_quantizers = 8,
)

coarse_transformer = CoarseTransformer(
    wav2vec = wav2vec,
    codebook_size = 1024,
    num_coarse_quantizers = 3,
    dim = 512,
    depth = 6
)

coarse_wrapper = CoarseTransformerWrapper(
    wav2vec = wav2vec,
    soundstream = soundstream,
    transformer = coarse_transformer
).cuda()

wave = torch.randn(1, 32 * 320).cuda()

loss = coarse_wrapper(
    raw_wave = wave,
    return_loss = True
)

loss.backward()

ex. FineTransformer

import torch
from audiolm_pytorch import SoundStream, FineTransformer, FineTransformerWrapper

soundstream = SoundStream(
    codebook_size = 1024,
    rq_num_quantizers = 8,
)

soundstream.load('/path/to/trained/soundstream.pt')

transformer = FineTransformer(
    num_coarse_quantizers = 3,
    num_fine_quantizers = 5,
    codebook_size = 1024,
    dim = 512,
    depth = 6
)

train_wrapper = FineTransformerWrapper(
    soundstream = soundstream,
    transformer = transformer
).cuda()

wave = torch.randn(1, 320 * 512).cuda()

loss = train_wrapper(
    raw_wave = wave,
    return_loss = True
)

loss.backward()
  • show how to generate from prompt tensor or file

Appreciation

  • Stability.ai for the generous sponsorship to work and open source cutting edge artificial intelligence research

  • ๐Ÿค— Huggingface for their amazing accelerate and transformers libraries

  • MetaAI for Fairseq and the liberal license

  • @eonglints for offering his professional advice and expertise

Todo

  • complete CoarseTransformer

  • use fairseq vq-wav2vec for embeddings

  • add conditioning

  • add classifier free guidance

  • add unique consecutive for

  • incorporate ability to use hubert intermediate features as semantic tokens, recommended by eonglints

  • accommodate variable lengthed audio, bring in eos token

  • make sure unique consecutive works with coarse transformer

  • pretty printing all discriminator losses to log

  • handle when generating semantic tokens, that last logits may not be necessarily the last in the sequence given unique consecutive processing

  • complete full training code for soundstream, taking care of discriminator training

  • figure out how to do the normalization across each dimension mentioned in the paper, but ignore it for v1 of the framework

  • complete sampling code for both Coarse and Fine Transformers, which will be tricky

  • full transformer training code for all three transformers

  • make sure full inference with or without prompting works on the AudioLM class

  • offer option to weight tie coarse, fine, and semantic embeddings across the 3 hierarchical transformers

  • DRY a little at the end

  • figure out how to suppress logging in fairseq

  • test with speech synthesis for starters

  • abstract out conditioning + classifier free guidance into external module or potentially a package

  • add option to use flash attention

  • simplify training even more within AudioLM class

Citations

@inproceedings{Borsos2022AudioLMAL,
  title  = {AudioLM: a Language Modeling Approach to Audio Generation},
  author = {Zal{\'a}n Borsos and Rapha{\"e}l Marinier and Damien Vincent and Eugene Kharitonov and Olivier Pietquin and Matthew Sharifi and Olivier Teboul and David Grangier and Marco Tagliasacchi and Neil Zeghidour},
  year   = {2022}
}
@misc{https://doi.org/10.48550/arxiv.2107.03312,
  title  = {SoundStream: An End-to-End Neural Audio Codec},
  author = {Zeghidour, Neil and Luebs, Alejandro and Omran, Ahmed and Skoglund, Jan and Tagliasacchi, Marco},
  publisher = {arXiv},
  url    = {https://arxiv.org/abs/2107.03312},
  year   = {2021}
}
@misc{shazeer2020glu,
    title   = {GLU Variants Improve Transformer},
    author  = {Noam Shazeer},
    year    = {2020},
    url     = {https://arxiv.org/abs/2002.05202}
}
@article{Shazeer2019FastTD,
    title   = {Fast Transformer Decoding: One Write-Head is All You Need},
    author  = {Noam M. Shazeer},
    journal = {ArXiv},
    year    = {2019},
    volume  = {abs/1911.02150}
}
@article{Ho2022ClassifierFreeDG,
    title   = {Classifier-Free Diffusion Guidance},
    author  = {Jonathan Ho},
    journal = {ArXiv},
    year    = {2022},
    volume  = {abs/2207.12598}
}
@misc{crowson2022,
    author  = {Katherine Crowson},
    url     = {https://twitter.com/rivershavewings}
}
@misc{ding2021cogview,
    title   = {CogView: Mastering Text-to-Image Generation via Transformers},
    author  = {Ming Ding and Zhuoyi Yang and Wenyi Hong and Wendi Zheng and Chang Zhou and Da Yin and Junyang Lin and Xu Zou and Zhou Shao and Hongxia Yang and Jie Tang},
    year    = {2021},
    eprint  = {2105.13290},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

audiolm-pytorch's People

Contributors

aabzaliev avatar lucidrains avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.