Code Monkey home page Code Monkey logo

large-audio-models's Introduction

Large-Audio-Models

We keep track of something big in the audio domain, including speech, singing, music etc.

Contents

Prompt-based Audio Synthesis

  • SpeechX: Neural Codec Language Model as a Versatile Speech Transformer(2023), Xiaofei Wang et al. [PDF]
  • TANGO: Text-to-Audio Generation using Instruction Tuned LLM and Latent Diffusion Model(2023), Deepanway Ghosal et al. [PDF]
  • Diverse and Vivid Sound Generation from Text Descriptions(2023), Guangwei Li et al. [PDF]
  • NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers(2023), Kai Shen et al. [PDF]
  • AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models(2023), Yuancheng Wang et al. [PDF]
  • Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos(2023), Kun Su et al. [PDF]
  • FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model(2023), Ruiqing Xue et al. [PDF]
  • VALL-E X: Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling (2023), Ziqiang Zhang et al. [PDF]
  • Simple and Controllable Music Generation(2023), Jade Copet et al. [PDF]
  • Efficient Neural Music Generation(2023), Max W. Y. Lam et al. [PDF]
  • ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models(2023), Pengfei Zhu et al. [PDF]
  • Noise2Music: Text-conditioned Music Generation with Diffusion Models(2023), Qingqing Huang et al. [PDF]
  • Spear-TTS: Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision(2023), Eugene Kharitonov et al. [PDF]
  • SingSong: Generating musical accompaniments from singing(2023), Chris Donahue et al. [PDF]
  • MusicLM: Generating Music From Text(2023), Andrea Agostinelli et al. [PDF]
  • InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt (2023), Dongchao Yang et al. [PDF]
  • Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation(2023), Rongjie Huang et al. [PDF]
  • AudioLDM: Text-to-Audio Generation with Latent Diffusion Models(2023), Haohe Liu et al. [PDF]
  • Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion(2023), Flavio Schneider et al. [PDF]
  • Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models(2023), Jiawei Huang et al. [PDF]
  • ArchiSound: Audio Generation with Diffusion(2023), Flavio Schneider. [PDF]
  • VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (2023), Chengyi Wang et al. [PDF]
  • PromptTTS: Controllable Text-to-Speech with Text Descriptions(2022), Zhifang Guo et al. [PDF]
  • Diffsound: Discrete Diffusion Model for Text-to-sound Generation(2022), Dongchao Yang et al. [PDF]

Audio Language Models

  • SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models(2023), Xin Zhang et al. [PDF]
  • SoundStorm: Efficient Parallel Audio Generation(2023), Zalán Borsos et al. [PDF]
  • AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head(2023), Rongjie Huang et al. [PDF]
  • AudioPaLM: A Large Language Model That Can Speak and Listen(2023), Paul K. Rubenstein et al. [PDF]
  • Pengi: An Audio Language Model for Audio Tasks(2023), Soham Deshmukh et al. [PDF]
  • AudioLM: a Language Modeling Approach to Audio Generation(2022), Zalán Borsos et al. [PDF]

Audio SSL and UL models

  • vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations(2019), Alexei Baevski et al. [PDF]
  • wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (2020), Alexei Baevski et al. [PDF]
  • W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training (2021) [PDF]
  • HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units (2021) Wei-Ning Hsu et al. [PDF]
  • Data2vec: A general framework for self-supervised learning in speech, vision and language (2022), Alexei Baevski et al. [PDF]
  • MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets (2022), Ziyang Ma et al. [PDF]
  • ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers (2022), Kaizhi Qian et al. [PDF]
  • Data2vec 2.0: Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language (2022), Alexei Baevski et al. [PDF]
  • MuLan: A Joint Embedding of Music Audio and Natural Language (2022) Qingqing Huang et al. [PDF]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.