Large-Audio-Models

We keep track of something big in the audio domain, including speech, singing, music etc.

Prompt-based Audio Synthesis
Audio Language Models
Audio SSL/UL models

Prompt-based Audio Synthesis

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer(2023), Xiaofei Wang et al. [PDF]
TANGO: Text-to-Audio Generation using Instruction Tuned LLM and Latent Diffusion Model(2023), Deepanway Ghosal et al. [PDF]
Diverse and Vivid Sound Generation from Text Descriptions(2023), Guangwei Li et al. [PDF]
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers(2023), Kai Shen et al. [PDF]
AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models(2023), Yuancheng Wang et al. [PDF]
Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos(2023), Kun Su et al. [PDF]
FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model(2023), Ruiqing Xue et al. [PDF]
VALL-E X: Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling (2023), Ziqiang Zhang et al. [PDF]
Simple and Controllable Music Generation(2023), Jade Copet et al. [PDF]
Efficient Neural Music Generation(2023), Max W. Y. Lam et al. [PDF]
ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models(2023), Pengfei Zhu et al. [PDF]
Noise2Music: Text-conditioned Music Generation with Diffusion Models(2023), Qingqing Huang et al. [PDF]
Spear-TTS: Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision(2023), Eugene Kharitonov et al. [PDF]
SingSong: Generating musical accompaniments from singing(2023), Chris Donahue et al. [PDF]
MusicLM: Generating Music From Text(2023), Andrea Agostinelli et al. [PDF]
InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt (2023), Dongchao Yang et al. [PDF]
Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation(2023), Rongjie Huang et al. [PDF]
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models(2023), Haohe Liu et al. [PDF]
Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion(2023), Flavio Schneider et al. [PDF]
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models(2023), Jiawei Huang et al. [PDF]
ArchiSound: Audio Generation with Diffusion(2023), Flavio Schneider. [PDF]
VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (2023), Chengyi Wang et al. [PDF]
PromptTTS: Controllable Text-to-Speech with Text Descriptions(2022), Zhifang Guo et al. [PDF]
Diffsound: Discrete Diffusion Model for Text-to-sound Generation(2022), Dongchao Yang et al. [PDF]

Audio Language Models

SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models(2023), Xin Zhang et al. [PDF]
SoundStorm: Efficient Parallel Audio Generation(2023), Zalán Borsos et al. [PDF]
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head(2023), Rongjie Huang et al. [PDF]
AudioPaLM: A Large Language Model That Can Speak and Listen(2023), Paul K. Rubenstein et al. [PDF]
Pengi: An Audio Language Model for Audio Tasks(2023), Soham Deshmukh et al. [PDF]
AudioLM: a Language Modeling Approach to Audio Generation(2022), Zalán Borsos et al. [PDF]

Audio SSL and UL models

vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations(2019), Alexei Baevski et al. [PDF]
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (2020), Alexei Baevski et al. [PDF]
W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training (2021) [PDF]
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units (2021) Wei-Ning Hsu et al. [PDF]
Data2vec: A general framework for self-supervised learning in speech, vision and language (2022), Alexei Baevski et al. [PDF]
MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets (2022), Ziyang Ma et al. [PDF]
ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers (2022), Kaizhi Qian et al. [PDF]
Data2vec 2.0: Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language (2022), Alexei Baevski et al. [PDF]
MuLan: A Joint Embedding of Music Audio and Natural Language (2022) Qingqing Huang et al. [PDF]

ddlbojack / large-audio-models Goto Github PK

large-audio-models's Introduction

Large-Audio-Models

Contents

Prompt-based Audio Synthesis

Audio Language Models

Audio SSL and UL models

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent